Using Automatic Differentiation

Analytically finding and then implementing the Jacobian matrix of the mean function for complex models can be time-consuming. It is possible to use your favorite Automatic Differentiation package for this task. This vignette shows how to use ForwardDiff.jl for the Sigmoid Emax model from the tutorial. Wrapping other autodiff packages should work similarly.

We start by comparing a naive autodiff version to manual gradients, and then show a slightly hacky but faster version.

Naive Implementation

using Kirstine, Random, Plots, ForwardDiff
using BenchmarkTools: @benchmark

@simple_model SigEmax dose

Since ForwardDiff can only compute the derivative with respect to a Vector argument, we define a Kirstine.Parameter subtype that wraps a Vector.

struct SigEmaxVectorParameter <: Kirstine.Parameter
    θ::Vector{Float64} # θ = (e0, emax, ed50, h)
end

Kirstine.dimension(p::SigEmaxVectorParameter) = 4

# We need to prevent NaN in the automatic gradient for x == 0
mu(x, θ) = x == 0 ? θ[1] : θ[1] + θ[2] / ((x / θ[3])^(-θ[4]) + 1)

function jacobianmatrix_auto!(
    jm,
    m::SigEmaxModel,
    c::SigEmaxCovariate,
    p::SigEmaxVectorParameter,
)
    f(θ) = mu(c.dose, θ)
    return ForwardDiff.gradient!(jm, f, p.θ)
end

function jacobianmatrix_manual!(
    jm,
    m::SigEmaxModel,
    c::SigEmaxCovariate,
    p::SigEmaxVectorParameter,
)
    dose_pow_h = c.dose^p.θ[4]
    ed50_pow_h = p.θ[3]^p.θ[4]
    A = dose_pow_h / (dose_pow_h + ed50_pow_h)
    B = ed50_pow_h * p.θ[2] / (dose_pow_h + ed50_pow_h)
    jm[1, 1] = 1.0
    jm[1, 2] = A
    jm[1, 3] = -A * B * p.θ[4] / p.θ[3]
    jm[1, 4] = c.dose == 0 ? 0.0 : A * B * log(c.dose / p.θ[3])
    return jm
end

We set up an example design problem.

theta_mean = [1, 2, 0.4, 5]
theta_sd = [0.5, 0.5, 0.05, 0.5]
Random.seed!(31415)
sep_draws = map(1:1000) do i
    rnd = theta_mean .+ theta_sd .* randn(4)
    return SigEmaxVectorParameter(rnd)
end
prior_sample = PriorSample(sep_draws)

dp1 = DesignProblem(
    criterion = DCriterion(),
    region = DesignInterval(:dose => (0, 1)),
    model = SigEmaxModel(sigma = 1),
    covariate_parameterization = CopyTo(:dose),
    prior_knowledge = prior_sample,
)

Let's first compare the time it takes to evaluate the Jacobian matrices once.

jm = zeros(1, 4)
co = SigEmaxCovariate(0.5)
bm1 = @benchmark jacobianmatrix_auto!($jm, $model(dp1), $co, $sep_draws[1])
BenchmarkTools.Trial: 10000 samples with 315 evaluations per sample.
 Range (minmax):  272.981 ns 16.210 μs   GC (min … max): 0.00% … 97.19%
 Time  (median):     299.530 ns                GC (median):    0.00%
 Time  (mean ± σ):   334.803 ns ± 326.307 ns   GC (mean ± σ):  7.54% ±  8.24%

  ▁▁▅█▄▃▁▁ ▁▂▂▃▂                                              ▂
  ███████████████▆▆▅▅▆▆▆▅▅▄▄▃▃▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄ █
  273 ns        Histogram: log(frequency) by time        734 ns <

 Memory estimate: 416 bytes, allocs estimate: 6.
bm2 = @benchmark jacobianmatrix_manual!($jm, $model(dp1), $co, $sep_draws[1])
BenchmarkTools.Trial: 10000 samples with 984 evaluations per sample.
 Range (minmax):  55.605 ns 3.959 μs   GC (min … max): 0.00% … 97.80%
 Time  (median):     65.280 ns               GC (median):    0.00%
 Time  (mean ± σ):   66.131 ns ± 39.477 ns   GC (mean ± σ):  0.59% ±  0.98%

           █▁                                                
  ▂▂▁▁▂▂▃▆▃██▃▃▄▇▆▅▃▂▄▅▄▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  55.6 ns         Histogram: frequency by time        90.9 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

We clearly see that the automatic gradient is about three times slower, and that it does allocate additional memory.

Next, we compare the time it takes to run the Particle Swarm optimizer for a given number of iterations. In these examples the iterations and swarmsize are smaller than needed for full convergence in order to keep documentation compile time low. But since runtime scales linearly in each, this is no obstacle for timing comparisons.

# force compilation before time measurement
dummy = DirectMaximization(
    optimizer = Pso(iterations = 2, swarmsize = 2),
    prototype = equidistant_design(region(dp1), 6),
)

strategy = DirectMaximization(
    optimizer = Pso(iterations = 20, swarmsize = 50),
    prototype = equidistant_design(region(dp1), 6),
)

Kirstine.jacobianmatrix!(jm, m, c, p) = jacobianmatrix_auto!(jm, m, c, p)
solve(dp1, dummy; maxweight = 1e-3, maxdist = 1e-2)
Random.seed!(54321)
@time s1, r1 = solve(dp1, strategy; maxweight = 1e-3, maxdist = 1e-2)
nothing # hide (force inserting stdout from @time)
  3.305334 seconds (30.44 M allocations: 2.257 GiB, 6.76% gc time, 0.21% compilation time)
Kirstine.jacobianmatrix!(jm, m, c, p) = jacobianmatrix_manual!(jm, m, c, p)
solve(dp1, dummy; maxweight = 1e-3, maxdist = 1e-2)
Random.seed!(54321)
@time s2, r2 = solve(dp1, strategy; maxweight = 1e-3, maxdist = 1e-2)
  1.293200 seconds (356.60 k allocations: 16.071 MiB, 0.56% gc time)

Here we see that solving the design problem with automatic gradients is about half as fast as when using the manual gradient, and it needs about three orders of magnitude more memory and allocations. In the next section we look at a way to speed things up by reducing the number of allocations. But first let's check that both candidate solutions are equal.

s1 == s2
true

Faster Implementation

One way to reduce the number of allocations is to pass a ForwardDiff.GradientConfig object to ForwardDiff.gradient!. However, this will not work if the closure θ -> mu(c.dose, θ) changes in every call to jacobianmatrix!. The following is a slightly hacky solution to this problem which misuses the model type.

  1. Add a GradientConfig field to the model type.
  2. Add a SigEmaxCovariate field to the model type,
  3. In the model constructor, set up a closure over this covariate, and save the closure in a third additional field.
  4. In jacobianmatrix!, modify the closed-over covariate and then call gradient! with the saved closure.
struct HackySigEmaxModel{T<:Function} <: Kirstine.NonlinearRegression
    sigma::Float64
    cfg::ForwardDiff.GradientConfig
    c::SigEmaxCovariate
    mu_closure::T
    function HackySigEmaxModel(; sigma, pardim, mu)
        c = SigEmaxCovariate(0)
        mu_closure = θ -> mu(c, θ)
        cfg = ForwardDiff.GradientConfig(
            mu_closure,
            zeros(1, pardim),
            ForwardDiff.Chunk{pardim}(),
        )
        new{typeof(mu_closure)}(sigma, cfg, c, mu_closure)
    end
end

Kirstine.allocate_covariate(m::HackySigEmaxModel) = SigEmaxCovariate(0)
Kirstine.unit_length(m::HackySigEmaxModel) = 1
function Kirstine.update_model_vcov!(Sigma, m::HackySigEmaxModel, c::SigEmaxCovariate)
    Sigma[1, 1] = m.sigma^2
    return Sigma
end

function Kirstine.jacobianmatrix!(
    jm,
    m::HackySigEmaxModel,
    c::SigEmaxCovariate,
    p::SigEmaxVectorParameter,
)
    m.c.dose = c.dose # modify the closed-over covariate
    return ForwardDiff.gradient!(jm, m.mu_closure, p.θ, m.cfg)
end

mu(c, θ) = c.dose == 0 ? θ[1] : θ[1] + θ[2] / ((c.dose / θ[3])^(-θ[4]) + 1)
dph = DesignProblem(dp1; model = HackySigEmaxModel(sigma = 1, pardim = 4, mu = mu))

bm3 = @benchmark Kirstine.jacobianmatrix!($jm, $model(dph), $co, $sep_draws[1])
BenchmarkTools.Trial: 10000 samples with 967 evaluations per sample.
 Range (minmax):  82.323 ns  8.597 μs   GC (min … max): 0.00% … 98.49%
 Time  (median):     91.554 ns                GC (median):    0.00%
 Time  (mean ± σ):   99.534 ns ± 137.617 ns   GC (mean ± σ):  3.93% ±  3.05%

          █ ▁                                                  
  ▁▃▂▂▂▂▂▄███▃▃▂▂▂▁▁▁▁▁▁▁▁▂▂▂▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  82.3 ns         Histogram: frequency by time          134 ns <

 Memory estimate: 80 bytes, allocs estimate: 3.

This time, the automatic gradient is only slower than the manual one by a factor of two.

solve(dph, dummy; maxweight = 1e-3, maxdist = 1e-2)
Random.seed!(54321)
@time s3, r3 = solve(dph, strategy; maxweight = 1e-3, maxdist = 1e-2)
  1.412013 seconds (6.37 M allocations: 107.883 MiB, 0.85% gc time)

Solving the design problem now only takes about 30% more time. The number of allocations is still much higher than with the manual gradient, but cumulative memory usage compared to the naive implementation is reduced by a factor of 4.

The candidate solution is still the same.

s3 == s1
true

Transformations

A place where autodiff does not incur a noticeable performance hit is the computation of a Kirstine.Transformation's Jacobian matrix at the parameter values in the prior sample, since these are computed only once for a design problem.

To illustrate this, lets try to find a design that is optimal for estimating $\mathrm{ED}_{ρ}$ with some fixed $0 < ρ < 100$, i.e. the concentration such that

\[\MeanFunction(\mathrm{ED}_{ρ},\Parameter) = E_0 + \frac{ρ}{100} E_{\max{}}.\]

While it is relatively easy to find that

\[\mathrm{ED}_{ρ} = \Transformation(\Parameter) := \frac{\mathrm{ED}_{50}}{(100/ρ - 1)^{1/h}},\]

figuring out the partial derivatives of $\Transformation$ is somewhat tedious. The automatic version, in contrast, only takes on line of code.

# switch back to faster manual jacobian matrix of the mean function
Kirstine.jacobianmatrix!(jm, m, c, p) = jacobianmatrix_manual!(jm, m, c, p)

ED(θ, rho) = [θ[3] / (100 / rho - 1)^(1 / θ[4])]

function DED_manual(p, rho)
    del_e0 = 0
    del_emax = 0
    a = (100 / rho - 1)^(1 / p.θ[4])
    del_ed50 = 1 / a
    del_h = p.θ[3] * log(100 / rho - 1) / (p.θ[4]^2 * a)
    return [del_e0 del_emax del_ed50 del_h]
end

function DED_auto(p, rho)
    return ForwardDiff.jacobian(θ -> ED(θ, rho), p.θ)
end

# quickly check for implementation error
DED_manual(parameters(prior_sample)[1], 10) .- DED_auto(parameters(prior_sample)[1], 10)
1×4 Matrix{Float64}:
 0.0  0.0  0.0  -6.93889e-18
# optimal for ED_{90}
dp2 = DesignProblem(dp1; transformation = DeltaMethod(p -> DED_auto(p, 90)))

solve(dp2, dummy; maxweight = 1e-3, maxdist = 1e-2)
Random.seed!(54321)
@time s3, r3 = solve(dp2, strategy; maxweight = 1e-3, maxdist = 1e-2)
  1.651924 seconds (413.43 k allocations: 20.712 MiB)
# optimal for ED_{90}
dp3 = DesignProblem(dp1; transformation = DeltaMethod(p -> DED_manual(p, 90)))
solve(dp3, dummy; maxweight = 1e-3, maxdist = 1e-2)
Random.seed!(54321)
@time s4, r4 = solve(dp3, strategy; maxweight = 1e-3, maxdist = 1e-2)
  1.627623 seconds (410.41 k allocations: 19.932 MiB)

As we see, both versions take practically the same amount of time.

s3 == s4
true