Validating Pull Requests on GitHub

library(hubValidations)

Running validation checks on a Pull Request with `validate_pr()`

The validate_pr() functions is designed to be used to validate team submissions through Pull Requests on GitHub. Only model output and model metadata files are individually validated using validate_submission() or validate_model_metadata() respectively on each file according to file type (See the end of this article for details of the standard checks performed on each file. For more information on deploying optional or custom functions please check the article on including custom functions (vignette("custom-functions"))). As part of checks, however, hub config files are also validated. Any other files included in the PR are ignored but flagged in a message.

Deploying `validate_pr()` though a GitHub Action workflow

The most common way to deploy validate_pr() is through a GitHub Action that triggers when a pull request containing changes to model output or model metadata files is opened. The hubverse maintains the validate-submission.yaml GitHub Action workflow template for deploying validate_pr().

The latest release of the workflow can be added to hub’s GitHub Action workflows using the hubCI package:

hubCI::use_hub_github_action("validate-submission")

The pertinent section of the workflow is:

      - name: Run validations
        env:
          PR_NUMBER: ${{ github.event.number }}
        run: |
          library("hubValidations")
          v <- hubValidations::validate_pr(
              gh_repo = Sys.getenv("GITHUB_REPOSITORY"),
              pr_number = Sys.getenv("PR_NUMBER"),
              skip_submit_window_check = FALSE
          )
          hubValidations::check_for_errors(v, verbose = TRUE)
        shell: Rscript {0}

where validate_pr() is called on the contents of the current Pull Request, the results (an S3 <hub_validations> class object) is stored in v and then check_for_errors() used to signal whether overall validations have passed or failed and summarise any validation failures.

Here’s an example of what the workflow looks like on GitHub:

Skipping submission window checks

Most hubs require that model output files for a given round are submitted within a submission window defined in the "submission_due" property of the tasks.json hub config file.

validate_pr() includes submission window checks for model output files and returns a <warning/check_failure> condition class object if a file is submitted outside the accepted submission window.

To disable submission window checks, argument skip_submit_window_check can be set to TRUE.

Configuring file modification/deletion/renaming checks

For most hubs, modification, renaming or deletion of previously submitted model output files or deletion/renaming of previously submitted model metadata files is not desirable without justification. They should therefore trigger validation failure and notify hub maintainers of the files affected. At the same time, most hubs prefer to allow modifications to model output files within their allowed submission window.

Reflecting these preferences, by default, validate_pr() checks for modification, renaming or deletion of previously submitted model output files and deletion/renaming of previously submitted model metadata files and appends a <error/check_error> class objects to the output for each file modification/deletion/renaming detected. It does however allow modifications to model output files within their allowed submission window.

temp_hub <- fs::path(tempdir(), "mod_del_hub")
gert::git_clone(
  url = "https://github.com/hubverse-org/ci-testhub-simple",
  path = temp_hub,
  branch = "test-mod-del"
)

v <- validate_pr(
  hub_path = temp_hub,
  gh_repo = "hubverse-org/ci-testhub-simple",
  pr_number = 6,
  skip_submit_window_check = TRUE
)
#> ℹ PR contains commits to additional files which have not been checked:
#> • ".github/workflows/validate_submission.yaml"
#> • "README.md"
#> • "hub-config/admin.json"
#> • "hub-config/model-metadata-schema.json"
#> • "hub-config/tasks.json"
#> • "model-metadata/README.md"
#> • "model-output/hub-baseline/README.txt"
#> • "random-file.txt"


v
#> ::notice ::✔ mod_del_hub: All hub config files are valid.%0A✖ 2022-10-08-hub-baseline.csv: Previously submitted model output files must not%0A  be modified.  model-output/hub-baseline/2022-10-08-hub-baseline.csv modified.%0A✖ 2022-10-15-team1-goodmodel.csv: Previously submitted model output files must%0A  not be removed.  model-output/team1-goodmodel/2022-10-15-team1-goodmodel.csv%0A  removed.%0A✖ team1-goodmodel.yaml: Previously submitted model metadata files must not be%0A  removed.  model-metadata/team1-goodmodel.yaml removed.%0A✔ 2022-10-08-hub-baseline.csv: File exists at path%0A  model-output/hub-baseline/2022-10-08-hub-baseline.csv.%0A✔ 2022-10-08-hub-baseline.csv: File name "2022-10-08-hub-baseline.csv" is%0A  valid.%0A✔ 2022-10-08-hub-baseline.csv: File directory name matches `model_id` metadata%0A  in file name.%0A✔ 2022-10-08-hub-baseline.csv: `round_id` is valid.%0A✔ 2022-10-08-hub-baseline.csv: File is accepted hub format.%0A✔ 2022-10-08-hub-baseline.csv: Metadata file exists at path%0A  model-metadata/hub-baseline.yml.%0A✔ 2022-10-08-hub-baseline.csv: File could be read successfully.%0A✔ 2022-10-08-hub-baseline.csv: `round_id_col` name is valid.%0A✔ 2022-10-08-hub-baseline.csv: `round_id` column "origin_date" contains a%0A  single, unique round ID value.%0A✔ 2022-10-08-hub-baseline.csv: All `round_id_col` "origin_date" values match%0A  submission `round_id` from file name.%0A✔ 2022-10-08-hub-baseline.csv: Column names are consistent with expected round%0A  task IDs and std column names.%0A✔ 2022-10-08-hub-baseline.csv: Column data types match hub schema.%0A✔ 2022-10-08-hub-baseline.csv: `tbl` contains valid values/value combinations.%0A✔ 2022-10-08-hub-baseline.csv: All combinations of task ID%0A  column/`output_type`/`output_type_id` values are unique.%0A✔ 2022-10-08-hub-baseline.csv: Required task ID/output type/output type ID%0A  combinations all present.%0A✔ 2022-10-08-hub-baseline.csv: Values in column `value` all valid with respect%0A  to modeling task config.%0A✔ 2022-10-08-hub-baseline.csv: Values in `value` column are non-decreasing as%0A  output_type_ids increase for all unique task ID value/output type%0A  combinations of quantile or cdf output types.%0Aℹ 2022-10-08-hub-baseline.csv: No pmf output types to check for sum of 1. Check%0A  skipped.%0Aℹ 2022-10-08-hub-baseline.csv: No v3 samples found in model output data to%0A  check. Skipping `skip_v3_spl_check` check.%0Aℹ 2022-10-08-hub-baseline.csv: No v3 samples found in model output data to%0A  check. Skipping `skip_v3_spl_check` check.%0Aℹ 2022-10-08-hub-baseline.csv: No v3 samples found in model output data to%0A  check. Skipping `skip_v3_spl_check` check.%0A✔ 2022-10-22-team1-goodmodel.csv: File exists at path%0A  model-output/team1-goodmodel/2022-10-22-team1-goodmodel.csv.%0A✔ 2022-10-22-team1-goodmodel.csv: File name "2022-10-22-team1-goodmodel.csv" is%0A  valid.%0A✔ 2022-10-22-team1-goodmodel.csv: File directory name matches `model_id`%0A  metadata in file name.%0A✔ 2022-10-22-team1-goodmodel.csv: `round_id` is valid.%0A✔ 2022-10-22-team1-goodmodel.csv: File is accepted hub format.%0A✖ 2022-10-22-team1-goodmodel.csv: Metadata file does not exist at path%0A  model-metadata/team1-goodmodel.yml or model-metadata/team1-goodmodel.yaml.

These settings can be modified if required though the use of arguments file_modification_check and allow_submit_window_mods.

file_modification_check controls whether modification/deletion checks are performed, what is returned if modifications/deletions are detected and accepts one of the following values:
- "error": Appends a <error/check_error> condition class object for each applicable modified/deleted file. Will result in validation workflow failure.
- "warning": Appends a <warning/check_warning> condition class object for each applicable modified/deleted file. Will result in validation workflow failure.
- "message": Appends a <message/check_info> condition class object for each applicable modified/deleted file. Will not result in validation workflow failure.
- "none": No modification/deletion checks performed.
allow_submit_window_mods controls whether modifications/deletions of model output files are allowed within their submission windows. Is set to TRUE by default but can be set to FALSE if modifications/deletions are not allowed, regardless of timing. Is ignored when checking model metadata files as well as when file_modification_check is set to "none".

Warning

Note that to establish relative submission windows when performing modification/deletion checks and allow_submit_window_mods is TRUE, the reference date is taken as the round_id extracted from the file path. This is because we cannot extract dates from columns of deleted files. If hub submission window reference dates do not match round IDs in file paths, currently allow_submit_window_mods will not work correctly and is best set to FALSE. This only relates to hubs/rounds where submission windows are determined relative to a reference date and not when explicit submission window start and end dates are provided in the config.

For more details on submission window config see Setting up "submission_due" in the hubverse hubDocs.

Checking for validation failures with `check_for_errors()`

check_for_errors() is used to inspect a hub_validations class object, determine whether overall validations have passed or failed and summarise any detected errors/failures.

Validation failure

If any elements of the hub_validations object contain <error/check_error>, <warning/check_warning> or <error/check_exec_error> condition class objects, the function throws an error and prints the messages from the failing checks.

temp_hub <- fs::path(tempdir(), "invalid_sb_hub")
gert::git_clone(
  url = "https://github.com/hubverse-org/ci-testhub-simple",
  path = temp_hub,
  branch = "pr-missing-taskid"
)

v_fail <- validate_pr(
  hub_path = temp_hub,
  gh_repo = "hubverse-org/ci-testhub-simple",
  pr_number = 5,
  skip_submit_window_check = TRUE
)
#> ℹ PR contains commits to additional files which have not been checked:
#> • ".github/workflows/validate_submission.yaml"
#> • "hub-config/admin.json"
#> • "hub-config/model-metadata-schema.json"
#> • "hub-config/tasks.json"


check_for_errors(v_fail)
#> ::notice ::✖ 2022-10-22-hub-baseline.parquet: Column names must be consistent with%0A  expected round task IDs and std column names.  Expected column "age_group"%0A  not present in file.
#> Error in `check_for_errors()`:
#> ! 
#> The validation checks produced some failures/errors reported above.

Validation success

If all validations checks pass, check_for_errors() returns TRUE silently and prints:

✔ All validation checks have been successful.

temp_hub <- fs::path(tempdir(), "valid_sb_hub")
gert::git_clone(
  url = "https://github.com/hubverse-org/ci-testhub-simple",
  path = temp_hub,
  branch = "pr-valid"
)

v_pass <- validate_pr(
  hub_path = temp_hub,
  gh_repo = "hubverse-org/ci-testhub-simple",
  pr_number = 4,
  skip_submit_window_check = TRUE
)
#> ℹ PR contains commits to additional files which have not been checked:
#> • ".github/workflows/validate_submission.yaml"
#> • "hub-config/admin.json"
#> • "hub-config/model-metadata-schema.json"
#> • "hub-config/tasks.json"


check_for_errors(v_pass)
#> ✔ All validation checks have been successful.

Verbose output

If printing the results of all checks is preferred instead of just summarising the results of checks that failed, argument verbose can be set to TRUE.

check_for_errors(v_fail, verbose = TRUE)
#> 
#> ── Individual check results ──
#> 
#> ::notice ::✔ invalid_sb_hub: All hub config files are valid.%0A✔ 2022-10-22-hub-baseline.parquet: File exists at path%0A  model-output/hub-baseline/2022-10-22-hub-baseline.parquet.%0A✔ 2022-10-22-hub-baseline.parquet: File name "2022-10-22-hub-baseline.parquet"%0A  is valid.%0A✔ 2022-10-22-hub-baseline.parquet: File directory name matches `model_id`%0A  metadata in file name.%0A✔ 2022-10-22-hub-baseline.parquet: `round_id` is valid.%0A✔ 2022-10-22-hub-baseline.parquet: File is accepted hub format.%0A✔ 2022-10-22-hub-baseline.parquet: Metadata file exists at path%0A  model-metadata/hub-baseline.yml.%0A✔ 2022-10-22-hub-baseline.parquet: File could be read successfully.%0A✔ 2022-10-22-hub-baseline.parquet: `round_id_col` name is valid.%0A✔ 2022-10-22-hub-baseline.parquet: `round_id` column "origin_date" contains a%0A  single, unique round ID value.%0A✔ 2022-10-22-hub-baseline.parquet: All `round_id_col` "origin_date" values%0A  match submission `round_id` from file name.%0A✖ 2022-10-22-hub-baseline.parquet: Column names must be consistent with%0A  expected round task IDs and std column names.  Expected column "age_group"%0A  not present in file.
#> 
#> ── Overall validation result ───────────────────────────────────────────────────
#> ::notice ::✖ 2022-10-22-hub-baseline.parquet: Column names must be consistent with%0A  expected round task IDs and std column names.  Expected column "age_group"%0A  not present in file.
#> Error in `check_for_errors()`:
#> ! 
#> The validation checks produced some failures/errors reported above.



check_for_errors(v_pass, verbose = TRUE)
#> 
#> ── Individual check results ──
#> 
#> ::notice ::✔ valid_sb_hub: All hub config files are valid.%0A✔ 2022-10-22-team1-goodmodel.csv: File exists at path%0A  model-output/team1-goodmodel/2022-10-22-team1-goodmodel.csv.%0A✔ 2022-10-22-team1-goodmodel.csv: File name "2022-10-22-team1-goodmodel.csv" is%0A  valid.%0A✔ 2022-10-22-team1-goodmodel.csv: File directory name matches `model_id`%0A  metadata in file name.%0A✔ 2022-10-22-team1-goodmodel.csv: `round_id` is valid.%0A✔ 2022-10-22-team1-goodmodel.csv: File is accepted hub format.%0A✔ 2022-10-22-team1-goodmodel.csv: Metadata file exists at path%0A  model-metadata/team1-goodmodel.yaml.%0A✔ 2022-10-22-team1-goodmodel.csv: File could be read successfully.%0A✔ 2022-10-22-team1-goodmodel.csv: `round_id_col` name is valid.%0A✔ 2022-10-22-team1-goodmodel.csv: `round_id` column "origin_date" contains a%0A  single, unique round ID value.%0A✔ 2022-10-22-team1-goodmodel.csv: All `round_id_col` "origin_date" values match%0A  submission `round_id` from file name.%0A✔ 2022-10-22-team1-goodmodel.csv: Column names are consistent with expected%0A  round task IDs and std column names.%0A✔ 2022-10-22-team1-goodmodel.csv: Column data types match hub schema.%0A✔ 2022-10-22-team1-goodmodel.csv: `tbl` contains valid values/value%0A  combinations.%0A✔ 2022-10-22-team1-goodmodel.csv: All combinations of task ID%0A  column/`output_type`/`output_type_id` values are unique.%0A✔ 2022-10-22-team1-goodmodel.csv: Required task ID/output type/output type ID%0A  combinations all present.%0A✔ 2022-10-22-team1-goodmodel.csv: Values in column `value` all valid with%0A  respect to modeling task config.%0A✔ 2022-10-22-team1-goodmodel.csv: Values in `value` column are non-decreasing%0A  as output_type_ids increase for all unique task ID value/output type%0A  combinations of quantile or cdf output types.%0Aℹ 2022-10-22-team1-goodmodel.csv: No pmf output types to check for sum of 1.%0A  Check skipped.%0Aℹ 2022-10-22-team1-goodmodel.csv: No v3 samples found in model output data to%0A  check. Skipping `skip_v3_spl_check` check.%0Aℹ 2022-10-22-team1-goodmodel.csv: No v3 samples found in model output data to%0A  check. Skipping `skip_v3_spl_check` check.%0Aℹ 2022-10-22-team1-goodmodel.csv: No v3 samples found in model output data to%0A  check. Skipping `skip_v3_spl_check` check.
#> 
#> ── Overall validation result ───────────────────────────────────────────────────
#> ✔ All validation checks have been successful.

`validate_pr` check details

For details on the structure of <hub_validations> objects, including on how to access more information about specific checks, see vignette("hub-validations-class").

Checks on model output files

Details of checks performed by `validate_submission()` on model output files.
Name	Check	Early return	Fail output	Extra info
valid_config	Hub config valid	TRUE	check_error
submission_time	Current time within file submission window	FALSE	check_failure
file_exists	File exists at `file_path` provided	TRUE	check_error
file_name	File name valid	TRUE	check_error
file_location	File located in correct team directory	FALSE	check_failure
round_id_valid	File round ID is valid hub round IDs	TRUE	check_error
file_format	File format is accepted hub/round format	TRUE	check_error
metadata_exists	Model metadata file exists in expected location	TRUE	check_error
file_read	File can be read without errors	TRUE	check_error
valid_round_id_col	Round ID var from config exists in data column names. Skipped if `round_id_from_var` is FALSE in config.	FALSE	check_failure
unique_round_id	Round ID column contains a single unique round ID. Skipped if `round_id_from_var` is FALSE in config.	TRUE	check_error
match_round_id	Round ID from file contents matches round ID from file name. Skipped if `round_id_from_var` is FALSE in config.	TRUE	check_error
colnames	File column names match expected column names for round (i.e. task ID names + hub standard column names)	TRUE	check_error
col_types	File column types match expected column types from config. Mainly applicable to parquet & arrow files.	FALSE	check_failure
valid_vals	Columns (excluding `value` column) contain valid combinations of task ID / output type / output type ID values	TRUE	check_error	error_tbl: table of invalid task ID/output type/output type ID value combinations
rows_unique	Columns (excluding `value` column) contain unique combinations of task ID / output type / output type ID values	FALSE	check_failure
req_vals	Columns (excluding `value` column) contain all required combinations of task ID / output type / output type ID values	FALSE	check_failure	missing_df: table of missing task ID/output type/output type ID value combinations
value_col_valid	Values in `value` column are coercible to data type configured for each output type	FALSE	check_failure
value_col_non_desc	Values in `value` column are non-decreasing as output_type_ids increase for all unique task ID /output type value combinations. Applies to `quantile` or `cdf` output types only	FALSE	check_failure	error_tbl: table of rows affected
value_col_sum1	Values in the `value` column of `pmf` output type data for each unique task ID combination sum to 1.	FALSE	check_failure	error_tbl: table of rows affected
spl_compound_tid	Samples contain single unique values for each compound task ID within individual samples (v3 and above schema only).	TRUE	check_error	errors: list containing item for each sample failing validation with breakdown of unique values for each compound task ID.
spl_non_compound_tid	Samples contain single unique combination of non-compound task ID values across all samples (v3 and above schema only).	TRUE	check_error	errors: list containing item for each modeling task with vectors of output type ids of samples failing validation and example table of most frequent non-compound task ID value combination across all samples in the modeling task.
spl_n	Number of samples for a given compound idx falls within accepted compound task range (v3 and above schema only).	FALSE	check_failure	errors: list containing item for each compound_idx failing validation with sample count, metadata on expected samples and example table of expected structure for samples belonging to the compound idx in question.

Checks on model metadata files

Details of checks performed by `validate_model_metadata()` on model metadata files.
Name	Check	Early return	Fail output
metadata_schema_exists	A model metadata schema file exists in `hub-config` directory.	TRUE	check_error
metadata_file_exists	A file with name provided to argument `file_path` exists at the expected location (the `model-metadata` directory).	TRUE	check_error
metadata_file_ext	The metadata file has correct extension (yaml or yml).	TRUE	check_error
metadata_file_location	The metadata file has been saved to correct location.	TRUE	check_failure
metadata_matches_schema	The contents of the metadata file match the hub’s model metadata schema	TRUE	check_error
metadata_file_name	The metadata filename matches the model ID specified in the contents of the file.	TRUE	check_error

Custom checks

The standard checks discussed here are the checks deployed by default by the validate_pr function. For more information on deploying optional or custom functions please check the article on including custom functions (vignette("custom-functions")).

Running validation checks on a Pull Request with validate_pr()

Deploying validate_pr() though a GitHub Action workflow