Debugging and Fixing CRAN's 'Additional Checks' errors
Jan 17, 2020
Rich FitzJohn
12 minute read
Tags:

R packages that are published on CRAN are tested every night on a variety of platforms and on the development version of R to ensure that they continue to work. In addition, packages that contain compiled code (C, C++ or Fortran) are put through a raft of additional checks to ensure that the compiled code will not cause R to crash. Once an issue is found, the package maintainer gets an email and usually a fairly short window to fix the package before it is removed from CRAN. However, replicating the error locally can require installation of all sorts of esoteric tools (and a copy of the development version of R from source) and it’s not always obvious how or where to start.

This blog post documents the process I used in clearing three issues from our dde package (which implements a simple solver for delay differential equations - the astute blog reader may recognise it from previous debugging efforts).

Package authors (or at least I) typically find out about these problems when getting an email from CRAN but they may have been live for some time and are listed in an “Additional issues” section on a package’s check page - however, this is shown only when there is a problem!

This blog post also serves as a place for me to find this information next time I need it and is written with the hope that it helps someone else with their debugging and package repairing chores. It was written while I debugged each problem, and is probably only of interest if you face a similar problem, in which case I hope the verbosity is useful.

There are three sections to this blog post - “Undefined behaviour”, “valgrind” and “rchk” corresponding to three of the possible additional checks. However, there are more types still, including testing with extremely new versions of gcc, alternative linear algebra implementations, and other diagnostic tools.

Undefined behaviour

The package had an issue where a unit test was doing something that was Undefined Behaviour. The information we get from CRAN to debug this error is:

> test_check("dde")
dopri.c:745:21: runtime error: -12.7138 is outside the range of representable values of type 'unsigned long'
    #0 0x7f030d7b378d in dopri_find_time /data/gannet/ripley/R/packages/tests-clang-SAN/dde/src/dopri.c:745:21
    #1 0x7f030d7b387e in ylag_1 /data/gannet/ripley/R/packages/tests-clang-SAN/dde/src/dopri.c:769:24
    #2 0x7f030d2c51e0 in ylag_1 /data/gannet/ripley/R/packages/tests-clang-SAN/dde.Rcheck/dde/include/dde/dde.c:15:10
    #3 0x7f030d2c51e0 in seir /tmp/RtmpNf9ol1/working_dir/RtmpgC6Usd/file82255939016e/seir.c:13:18
    #4 0x7f030d7afca4 in dopri_eval /data/gannet/ripley/R/packages/tests-clang-SAN/dde/src/dopri.c:822:3
    #5 0x7f030d7b89ef in dopri5_step /data/gannet/ripley/R/packages/tests-clang-SAN/dde/src/dopri_5.c:75:3
    #6 0x7f030d7abbfe in dopri_step /data/gannet/ripley/R/packages/tests-clang-SAN/dde/src/dopri.c:253:5
    #7 0x7f030d7abbfe in dopri_integrate /data/gannet/ripley/R/packages/tests-clang-SAN/dde/src/dopri.c:377:5
    #8 0x7f030d7d6118 in r_dopri /data/gannet/ripley/R/packages/tests-clang-SAN/dde/src/r_dopri.c:176:3
    #9 0x6f93f9 in R_doDotCall /data/gannet/ripley/R/svn/R-devel/src/main/dotcode.c:744:17
    ...
    #276 0x52349e in do_lapply /data/gannet/ripley/R/svn/R-devel/src/main/apply.c:70:8

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior dopri.c:745:21 in
══ testthat results  ═══════════════════════════════════════════════════════════
[ OK: 507 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 0 ]

(eliding just the uninformative part of the stack trace). So we know where in the C source the error comes from (dopri.c:745) but not the values that are passed to that code to trigger it. And to be sure that we’ve eliminated it we need to be able to replicate it locally.

To trigger this, I used the wch1/r-debug docker container as I’ve had success with this in the past. There are also containers from the rocker project that a similar approach will likely work with, such as rocker/r-devel-ubsan-clang.

First, I started the container with

docker run -v $PWD:/src:ro -it --rm --security-opt seccomp=unconfined wch1/r-debug

then, in that container, I prepared the R library using the R binary that was built with the undefined behaviour checkers enabled:

RDcsan -e 'install.packages(c("deSolve", "knitr", "ring", "microbenchmark", "rmarkdown", "testthat", "devtools", "roxygen2"))'
cp -r /src /dde

(Rdcsan is a version of R set up to use clang with the address and undefined behaviour sanitiser enabled - see the image readme). Running

RDcsan CMD INSTALL --preclean --install-tests /dde
RDcsan -e 'tools::testInstalledPackage("dde")'
cat dde-tests/testthat.Rout

showed the same output as CRAN reported - confirming I could replicate the error but still obscuring which test triggered it. In the end I ran the tests with the most verbose reporter:

RDcsan -e 'devtools::test("dde", reporter = testthat::LocationReporter)'

which produces

Start test: failure to fetch history
dopri.c:745:21: runtime error: -12.7138 is outside the range of representable values of type 'unsigned long'
    #0 0x7f472e7aa95a in dopri_find_time /dde/src/dopri.c:745:21
    #1 0x7f472e7ab097 in ylag_1 /dde/src/dopri.c:769:24
    #2 0x7f472cded83a in ylag_1 /dde/inst/include/dde/dde.c:15:10
...
    #250 0x7f474672e010 in getvar /tmp/r-source/src/main/eval.c:5128:14

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior dopri.c:745:21 in
  test-dde.R#269:1 [success]
End test: failure to fetch history

so we know where the error is triggered from! (Note that the UB checker only seems to report the error the first time it occurs in a session - it’s quite possible this is tuneable though.)

With a little surgery, a standalone script that reproduces the error is:

res <- dde:::shlib(system.file("examples/seir.c", package = "dde"), "dde_")
y0 <- c(1e7 - 1, 0, 1, 0)
times <- seq(0, 30, length.out = 301)
dde::dopri(y0, times, "seir", numeric(),
           atol = 1e-7, rtol = 1e-7, n_history = 2L,
           dllname = "dde_seir", return_history = FALSE)

and we can trigger the error in the instrumented copy of R by running RDcsan -f /src/bug.R in ~12s which is a bit less tedious than running the full suite. However, most of the debugging can be done in the plain copy. I compiled the package with optimisation turned off and debugging symbols enabled (CFLAGS = -g -O0), ran R with R -d gdb and created a breakpoint with break dopri.c:745 after loading the package.

The relevant bit of C looks like:

  size_t idx0 = 0;
  if (n > 0) {
    const double
      t0 = ((double*) ring_buffer_tail(obj->history))[idx_t],
      t1 = ((double*) ring_buffer_tail_offset(obj->history, n - 1))[idx_t];
    idx0 = min_size((t - t0) / (t1 - t0) / (n - 1), n - 1);
  }

This is an optimisation to seed a binary search for a value close to t in an array of values from t0 to t1 - we assume that the values between t0 and t1 are roughly evenly spaced and linearly interpolate between them to get a likely enough index for t, which we store as idx0.

When the undefined behaviour error is triggered (the second assignment to idx0), we have (approximately) t0 = 11, t1 = 12 and t = 0.02 which falls outside of the range of times, so the expression (t - t0) / (t1 - t0) / (n - 1) is negative and that’s the undefined behaviour because it falls outside of the valid values for a size_t (typically an unsigned long), which this expression is eventually cast as.

We can guard against this either by checking that t lies within (t0, t1) before doing the second assignment:

  size_t idx0 = 0;
  if (n > 0) {
    const double
      t0 = ((double*) ring_buffer_tail(obj->history))[idx_t],
      t1 = ((double*) ring_buffer_tail_offset(obj->history, n - 1))[idx_t];
    if ((t0 - t) * (t1 - t) < 0) {
      idx0 = (t - t0) / (t1 - t0) / (n - 1);
    }
  }

With this fix in place, the UBSAN checks pass without incident. See 8c384bb for details.

Valgrind

Valgrind finds memory errors and is one of my favourite tools for working out why something crashes. My (probably grossly oversimplified) understanding is that it runs a program with a layer that checks all memory accesses for correctness (alignment, bounds etc). Unsurprisingly this makes things very slow to run. The information on the CRAN page was:

==42439== Memcheck, a memory error detector
==42439== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==42439== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==42439== Command: /data/blackswan/ripley/R/R-devel-vg/bin/exec/R -f testthat.R --restore --save --no-readline --vanilla
==42439==

...

> library(testthat)
> library(dde)
>
> test_check("dde")
==42439== Invalid read of size 8
==42439==    at 0x48C5A50: dopri_data_reset (packages/tests-vg/dde/src/dopri.c:196)
==42439==    by 0x48C6906: dopri_integrate (packages/tests-vg/dde/src/dopri.c:294)
==42439==    by 0x48CB338: r_dopri (packages/tests-vg/dde/src/r_dopri.c:176)
==42439==    by 0x49B695: R_doDotCall (svn/R-devel/src/main/dotcode.c:744)
==42439==    by 0x49BFD4: do_dotcall (svn/R-devel/src/main/dotcode.c:1280)
==42439==    by 0x4D181C: bcEval (svn/R-devel/src/main/eval.c:7054)
==42439==    by 0x4E8197: Rf_eval (svn/R-devel/src/main/eval.c:688)
==42439==    by 0x4E9D56: R_execClosure (svn/R-devel/src/main/eval.c:1853)
==42439==    by 0x4EAB33: Rf_applyClosure (svn/R-devel/src/main/eval.c:1779)
==42439==    by 0x4E8363: Rf_eval (svn/R-devel/src/main/eval.c:811)
==42439==    by 0x4ECD01: do_set (svn/R-devel/src/main/eval.c:2920)
==42439==    by 0x4E85E4: Rf_eval (svn/R-devel/src/main/eval.c:763)
==42439==  Address 0x195135c0 is 1,760 bytes inside a block of size 7,960 alloc'd
==42439==    at 0x483880B: malloc (/builddir/build/BUILD/valgrind-3.15.0/coregrind/m_replacemalloc/vg_replace_malloc.c:309)
==42439==    by 0x5223E0: GetNewPage (svn/R-devel/src/main/memory.c:946)
==42439==    by 0x52418B: Rf_allocVector3 (svn/R-devel/src/main/memory.c:2784)
==42439==    by 0x4A4388: Rf_allocVector (svn/R-devel/src/include/Rinlinedfuns.h:593)
==42439==    by 0x4A4388: duplicate1 (svn/R-devel/src/main/duplicate.c:345)
==42439==    by 0x4E893F: EnsureLocal (svn/R-devel/src/main/eval.c:2048)
==42439==    by 0x4D2AE5: bcEval (svn/R-devel/src/main/eval.c:7146)
==42439==    by 0x4E8197: Rf_eval (svn/R-devel/src/main/eval.c:688)
==42439==    by 0x4E9D56: R_execClosure (svn/R-devel/src/main/eval.c:1853)
==42439==    by 0x4EAB33: Rf_applyClosure (svn/R-devel/src/main/eval.c:1779)
==42439==    by 0x4E8363: Rf_eval (svn/R-devel/src/main/eval.c:811)
==42439==    by 0x178CF77B: C_deriv_func (/tmp/RtmpTufyR0/R.INSTALLb7ae410bb7bc/deSolve/src/call_lsoda.c:127)
==42439==    by 0x179012CA: dstoda_ (/tmp/RtmpTufyR0/R.INSTALLb7ae410bb7bc/deSolve/src/opkda1.f:4200)
==42439==
══ testthat results  ═══════════════════════════════════════════════════════════
[ OK: 507 | SKIPPED: 0 | WARNINGS: 0 | FAILED: 0 ]

(eliding only the R startup and valgrind summary). Again, not a great deal to work with!

A copy of R instrumented with valgrind (improving debugging a bit) is also in the same docker image as above:

docker run -v $PWD:/src:ro -it --rm --security-opt seccomp=unconfined wch1/r-debug
RDvalgrind -e 'install.packages(c("deSolve", "knitr", "ring", "microbenchmark", "rmarkdown", "testthat", "devtools", "roxygen2"))'
cp -r /src /dde

Unfortunately, and surprisingly, running:

RDvalgrind CMD INSTALL --preclean --install-tests /dde
RDvalgrind -d valgrind -e 'tools::testInstalledPackage("dde")'
cat dde-tests/testthat.Rout

did not yield the error. But with a bit of poking based on the Command: line in the above valgrind output I got some success with:

(cd dde/tests && RDvalgrind -d valgrind -f testthat.R --no-readline --vanilla)

which shows the invalid read, alongside some less interesting output. Using the LocationReporter again was better still:

RDvalgrind -d valgrind -e 'devtools::test("dde", reporter = testthat::LocationReporter)'

showed

Start test: critical times
  test-ode.R#224:1 [success]
  test-ode.R#225:1 [success]
  test-ode.R#226:1 [success]
  test-ode.R#228:1 [success]
  test-ode.R#233:1 [success]
==5262== Invalid read of size 8
==5262==    at 0x180FD558: dopri_data_reset (dopri.c:196)
==5262==    by 0x180FD8CF: dopri_integrate (dopri.c:294)
==5262==    by 0x1810481B: r_dopri (r_dopri.c:176)
==5262==    by 0x4F30D5F: R_doDotCall (dotcode.c:744)
==5262==    by 0x4F3A8E5: do_dotcall (dotcode.c:1280)
==5262==    by 0x4F8AD8C: bcEval (eval.c:7054)
==5262==    by 0x4F7670A: Rf_eval (eval.c:688)
==5262==    by 0x4F79442: R_execClosure (eval.c:1853)
==5262==    by 0x4F790F6: Rf_applyClosure (eval.c:1779)
==5262==    by 0x4F76EF5: Rf_eval (eval.c:811)
==5262==    by 0x4F7D07C: do_set (eval.c:2920)
==5262==    by 0x4F76B5B: Rf_eval (eval.c:763)
==5262==  Address 0x17cde210 is 1,520 bytes inside a block of size 7,960 alloc'd
==5262==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5262==    by 0x4FCAE13: GetNewPage (memory.c:946)
==5262==    by 0x4FD9ED8: Rf_allocVector3 (memory.c:2784)
==5262==    by 0x4FBE3B9: Rf_allocVector (Rinlinedfuns.h:593)
==5262==    by 0x4EBD407: do_lapply (apply.c:46)
==5262==    by 0x4FE2024: do_internal (names.c:1379)
==5262==    by 0x4F8AFFB: bcEval (eval.c:7074)
==5262==    by 0x4F7670A: Rf_eval (eval.c:688)
==5262==    by 0x4F79442: R_execClosure (eval.c:1853)
==5262==    by 0x4F790F6: Rf_applyClosure (eval.c:1779)
==5262==    by 0x4F8A9D3: bcEval (eval.c:7022)
==5262==    by 0x4F7670A: Rf_eval (eval.c:688)
==5262==
  test-ode.R#238:1 [success]
  test-ode.R#243:1 [success]
End test: critical times

So we again know the location of the error and can reproduce it.

Taking the same approach as above, I boiled down the test into a minimal script that could be run easily to trigger the problems:

target <- function(t, y, p) {
  if (t <= 1) {
    y
  } else {
    -5 * y
  }
}
tt <- seq(0, 2, length.out = 200)
res <- dde::dopri(1, tt, target, numeric(0), tcrit = rep(tt[[1]], 3))

which could be run like

RDvalgrind -d valgrind -f /src/bug-valgrind.R

taking about 10s, which is fast as far as using valgrind goes. The code causing the problem was:

    double t0 = obj->sign * times[0];
    while (obj->sign * tcrit[obj->tcrit_idx] <= t0 &&
           obj->tcrit_idx < n_tcrit) {
      obj->tcrit_idx++;
    }

which is a nasty enough bit of book-keeping. But the problem is that the while condition’s two clauses are in the wrong order - when obj->tcrit_idx < n_tcrit is false we really should not be looking up tcrit[obj->tcrit_idx] because it is out of range, which is precisely what the “invalid read” error valgrind reported was. The fixed code looks like

    double t0 = obj->sign * times[0];
    while (obj->tcrit_idx < n_tcrit &&
           obj->sign * tcrit[obj->tcrit_idx] <= t0) {
      obj->tcrit_idx++;
    }

and is in 07f876a.

rchk

These are the results of static analysis tools (see details on CRAN). There are a lot of things checked by this tool, but one of the main ones - and the one behind all the problems below - is PROTECT/UNPROTECT errors, where the package author has failed to correctly protect memory from being reclaimed by R’s garbage collector. See this blog post by Tomas Kalibera for more background.

Thankfully, the information from CRAN is helpful by itself:

Package dde version 1.0.0
Package built using 77519/R 4.0.0; x86_64-pc-linux-gnu; 2019-12-05 05:52:35 UTC; unix
Checked with rchk version 490627e4fb8e93244230dbbd61455730aa43c328
More information at https://github.com/kalibera/cran-checks/blob/master/rchk/PROTECT.md

Suspicious call (two or more unprotected arguments) to Rf_setAttrib at r_difeq_cleanup dde/src/r_difeq.c:227
Suspicious call (two or more unprotected arguments) to Rf_setAttrib at r_dopri_cleanup dde/src/r_dopri.c:403
Suspicious call (two or more unprotected arguments) to Rf_setAttrib at r_dopri_cleanup dde/src/r_dopri.c:422

Function r_ylag
  [UP] unprotected variable r_y while calling allocating function r_indices dde/src/r_dopri.c:315
  [UP] unprotected variable r_y while calling allocating function ylag_vec dde/src/r_dopri.c:315

Function r_yprev
  [UP] unprotected variable r_y while calling allocating function r_indices dde/src/r_difeq.c:161

However, it is useful to recreate the errors locally so we can be sure that they are actually fixed.

To debug this I used a container created by the amazing r-hub project, rhub/ubuntu-rchk

docker run --rm -it -v $PWD:/src:ro -v ~/.Rprofile:/home/docker/.Rprofile -w /home/docker rhub/ubuntu-rchk

Inside the container I ran:

R -e 'install.packages(c("deSolve", "ring"))'
cp -r /src dde
rchk.sh dde

which produces the same errors as observed in the report.

These changes all turned out to be fairly straightforward and unambiguous errors. For example the code

    r_y = PROTECT(allocVector(REALSXP, ni));
    if (ni == 1) {
      REAL(r_y)[0] = yprev_1(step, r_index(r_idx, n));
    } else {
      r_y = allocVector(REALSXP, ni);
      yprev_vec(step, r_indices(r_idx, n), ni, REAL(r_y));
    }

produced the warning

[UP] unprotected variable r_y while calling allocating function r_indices dde/src/r_difeq.c:161

It turns out the second r_y was vestigial and should be deleted. It was dangerous because it was not protected and the function r_indices did an allocation so the memory underlying this second r_y could have been reclaimed.

Similarly, the call

    setAttrib(history, install("n"), ScalarInteger(obj->n));

produced the warning

Suspicious call (two or more unprotected arguments) to Rf_setAttrib at r_difeq_cleanup dde/src/r_difeq.c:227

because the SEXP produced by ScalarInteger(obj->n) could have been reclaimed when the install is run (and these could be evaluated in any order the compiler fancies). This was replaced with

    SEXP r_n = PROTECT(ScalarInteger(obj->n));
    setAttrib(history, install("n"), r_n);

and an additional UNPROTECT. See commit 72b7e60 for the full details.

Conclusions

This was all a bit tedious, but these were all errors in the package that could have resulted in crashes that would have been much more tedious to debug, especially as they would likely have turned up essentially non-determinstically as R’s garbage collector was triggered. It took longer to replicate all these errors locally than to fix them, but doing this was worthwhile because it removed the guesswork as to whether they were actually fixed.




comments powered by Disqus