Using Automatic Differentiation
Analytically finding and then implementing the Jacobian matrix of the mean function for complex models can be time-consuming. It is possible to use your favorite Automatic Differentiation package for this task. This vignette shows how to use ForwardDiff.jl for the Sigmoid Emax model from the tutorial. Wrapping other autodiff packages should work similarly.
We start by comparing a naive autodiff version to manual gradients, and then show a slightly hacky but faster version.
Naive Implementation
using Kirstine, Random, Plots, ForwardDiff
using BenchmarkTools: @benchmark
@simple_model SigEmax doseSince ForwardDiff can only compute the derivative with respect to a Vector argument, we define a Kirstine.Parameter subtype that wraps a Vector.
struct SigEmaxVectorParameter <: Kirstine.Parameter
θ::Vector{Float64} # θ = (e0, emax, ed50, h)
end
Kirstine.dimension(p::SigEmaxVectorParameter) = 4
# We need to prevent NaN in the automatic gradient for x == 0
mu(x, θ) = x == 0 ? θ[1] : θ[1] + θ[2] / ((x / θ[3])^(-θ[4]) + 1)
function jacobianmatrix_auto!(
jm,
m::SigEmaxModel,
c::SigEmaxCovariate,
p::SigEmaxVectorParameter,
)
f(θ) = mu(c.dose, θ)
return ForwardDiff.gradient!(jm, f, p.θ)
end
function jacobianmatrix_manual!(
jm,
m::SigEmaxModel,
c::SigEmaxCovariate,
p::SigEmaxVectorParameter,
)
dose_pow_h = c.dose^p.θ[4]
ed50_pow_h = p.θ[3]^p.θ[4]
A = dose_pow_h / (dose_pow_h + ed50_pow_h)
B = ed50_pow_h * p.θ[2] / (dose_pow_h + ed50_pow_h)
jm[1, 1] = 1.0
jm[1, 2] = A
jm[1, 3] = -A * B * p.θ[4] / p.θ[3]
jm[1, 4] = c.dose == 0 ? 0.0 : A * B * log(c.dose / p.θ[3])
return jm
endWe set up an example design problem.
theta_mean = [1, 2, 0.4, 5]
theta_sd = [0.5, 0.5, 0.05, 0.5]
Random.seed!(31415)
sep_draws = map(1:1000) do i
rnd = theta_mean .+ theta_sd .* randn(4)
return SigEmaxVectorParameter(rnd)
end
prior_sample = PriorSample(sep_draws)
dp1 = DesignProblem(
criterion = DCriterion(),
region = DesignInterval(:dose => (0, 1)),
model = SigEmaxModel(sigma = 1),
covariate_parameterization = CopyTo(:dose),
prior_knowledge = prior_sample,
)Let's first compare the time it takes to evaluate the Jacobian matrices once.
jm = zeros(1, 4)
co = SigEmaxCovariate(0.5)
bm1 = @benchmark jacobianmatrix_auto!($jm, $model(dp1), $co, $sep_draws[1])BenchmarkTools.Trial: 10000 samples with 315 evaluations per sample.
Range (min … max): 272.981 ns … 16.210 μs ┊ GC (min … max): 0.00% … 97.19%
Time (median): 299.530 ns ┊ GC (median): 0.00%
Time (mean ± σ): 334.803 ns ± 326.307 ns ┊ GC (mean ± σ): 7.54% ± 8.24%
▁▁▅█▆▄▃▁▂▁ ▁▂▂▃▂ ▂
█████████████████▆▆▅▅▆▆▆▅▅▄▄▃▃▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄ █
273 ns Histogram: log(frequency) by time 734 ns <
Memory estimate: 416 bytes, allocs estimate: 6.bm2 = @benchmark jacobianmatrix_manual!($jm, $model(dp1), $co, $sep_draws[1])BenchmarkTools.Trial: 10000 samples with 984 evaluations per sample.
Range (min … max): 55.605 ns … 3.959 μs ┊ GC (min … max): 0.00% … 97.80%
Time (median): 65.280 ns ┊ GC (median): 0.00%
Time (mean ± σ): 66.131 ns ± 39.477 ns ┊ GC (mean ± σ): 0.59% ± 0.98%
█▁ ▄▃
▂▂▁▁▂▂▃▆▃██▃▃▄▇▆██▇▅▃▂▄▅▄▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
55.6 ns Histogram: frequency by time 90.9 ns <
Memory estimate: 16 bytes, allocs estimate: 1.We clearly see that the automatic gradient is about three times slower, and that it does allocate additional memory.
Next, we compare the time it takes to run the Particle Swarm optimizer for a given number of iterations. In these examples the iterations and swarmsize are smaller than needed for full convergence in order to keep documentation compile time low. But since runtime scales linearly in each, this is no obstacle for timing comparisons.
# force compilation before time measurement
dummy = DirectMaximization(
optimizer = Pso(iterations = 2, swarmsize = 2),
prototype = equidistant_design(region(dp1), 6),
)
strategy = DirectMaximization(
optimizer = Pso(iterations = 20, swarmsize = 50),
prototype = equidistant_design(region(dp1), 6),
)
Kirstine.jacobianmatrix!(jm, m, c, p) = jacobianmatrix_auto!(jm, m, c, p)
solve(dp1, dummy; maxweight = 1e-3, maxdist = 1e-2)
Random.seed!(54321)
@time s1, r1 = solve(dp1, strategy; maxweight = 1e-3, maxdist = 1e-2)
nothing # hide (force inserting stdout from @time) 3.305334 seconds (30.44 M allocations: 2.257 GiB, 6.76% gc time, 0.21% compilation time)Kirstine.jacobianmatrix!(jm, m, c, p) = jacobianmatrix_manual!(jm, m, c, p)
solve(dp1, dummy; maxweight = 1e-3, maxdist = 1e-2)
Random.seed!(54321)
@time s2, r2 = solve(dp1, strategy; maxweight = 1e-3, maxdist = 1e-2) 1.293200 seconds (356.60 k allocations: 16.071 MiB, 0.56% gc time)Here we see that solving the design problem with automatic gradients is about half as fast as when using the manual gradient, and it needs about three orders of magnitude more memory and allocations. In the next section we look at a way to speed things up by reducing the number of allocations. But first let's check that both candidate solutions are equal.
s1 == s2trueFaster Implementation
One way to reduce the number of allocations is to pass a ForwardDiff.GradientConfig object to ForwardDiff.gradient!. However, this will not work if the closure θ -> mu(c.dose, θ) changes in every call to jacobianmatrix!. The following is a slightly hacky solution to this problem which misuses the model type.
- Add a
GradientConfigfield to the model type. - Add a
SigEmaxCovariatefield to the model type, - In the model constructor, set up a closure over this covariate, and save the closure in a third additional field.
- In
jacobianmatrix!, modify the closed-over covariate and then callgradient!with the saved closure.
struct HackySigEmaxModel{T<:Function} <: Kirstine.NonlinearRegression
sigma::Float64
cfg::ForwardDiff.GradientConfig
c::SigEmaxCovariate
mu_closure::T
function HackySigEmaxModel(; sigma, pardim, mu)
c = SigEmaxCovariate(0)
mu_closure = θ -> mu(c, θ)
cfg = ForwardDiff.GradientConfig(
mu_closure,
zeros(1, pardim),
ForwardDiff.Chunk{pardim}(),
)
new{typeof(mu_closure)}(sigma, cfg, c, mu_closure)
end
end
Kirstine.allocate_covariate(m::HackySigEmaxModel) = SigEmaxCovariate(0)
Kirstine.unit_length(m::HackySigEmaxModel) = 1
function Kirstine.update_model_vcov!(Sigma, m::HackySigEmaxModel, c::SigEmaxCovariate)
Sigma[1, 1] = m.sigma^2
return Sigma
end
function Kirstine.jacobianmatrix!(
jm,
m::HackySigEmaxModel,
c::SigEmaxCovariate,
p::SigEmaxVectorParameter,
)
m.c.dose = c.dose # modify the closed-over covariate
return ForwardDiff.gradient!(jm, m.mu_closure, p.θ, m.cfg)
end
mu(c, θ) = c.dose == 0 ? θ[1] : θ[1] + θ[2] / ((c.dose / θ[3])^(-θ[4]) + 1)
dph = DesignProblem(dp1; model = HackySigEmaxModel(sigma = 1, pardim = 4, mu = mu))
bm3 = @benchmark Kirstine.jacobianmatrix!($jm, $model(dph), $co, $sep_draws[1])BenchmarkTools.Trial: 10000 samples with 967 evaluations per sample.
Range (min … max): 82.323 ns … 8.597 μs ┊ GC (min … max): 0.00% … 98.49%
Time (median): 91.554 ns ┊ GC (median): 0.00%
Time (mean ± σ): 99.534 ns ± 137.617 ns ┊ GC (mean ± σ): 3.93% ± 3.05%
█ ▁▃
▁▃▂▂▂▂▂▄████▃▃▂▂▂▁▁▁▁▁▁▁▁▁▂▂▂▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
82.3 ns Histogram: frequency by time 134 ns <
Memory estimate: 80 bytes, allocs estimate: 3.This time, the automatic gradient is only slower than the manual one by a factor of two.
solve(dph, dummy; maxweight = 1e-3, maxdist = 1e-2)
Random.seed!(54321)
@time s3, r3 = solve(dph, strategy; maxweight = 1e-3, maxdist = 1e-2) 1.412013 seconds (6.37 M allocations: 107.883 MiB, 0.85% gc time)Solving the design problem now only takes about 30% more time. The number of allocations is still much higher than with the manual gradient, but cumulative memory usage compared to the naive implementation is reduced by a factor of 4.
The candidate solution is still the same.
s3 == s1trueTransformations
A place where autodiff does not incur a noticeable performance hit is the computation of a Kirstine.Transformation's Jacobian matrix at the parameter values in the prior sample, since these are computed only once for a design problem.
To illustrate this, lets try to find a design that is optimal for estimating $\mathrm{ED}_{ρ}$ with some fixed $0 < ρ < 100$, i.e. the concentration such that
\[\MeanFunction(\mathrm{ED}_{ρ},\Parameter) = E_0 + \frac{ρ}{100} E_{\max{}}.\]
While it is relatively easy to find that
\[\mathrm{ED}_{ρ} = \Transformation(\Parameter) := \frac{\mathrm{ED}_{50}}{(100/ρ - 1)^{1/h}},\]
figuring out the partial derivatives of $\Transformation$ is somewhat tedious. The automatic version, in contrast, only takes on line of code.
# switch back to faster manual jacobian matrix of the mean function
Kirstine.jacobianmatrix!(jm, m, c, p) = jacobianmatrix_manual!(jm, m, c, p)
ED(θ, rho) = [θ[3] / (100 / rho - 1)^(1 / θ[4])]
function DED_manual(p, rho)
del_e0 = 0
del_emax = 0
a = (100 / rho - 1)^(1 / p.θ[4])
del_ed50 = 1 / a
del_h = p.θ[3] * log(100 / rho - 1) / (p.θ[4]^2 * a)
return [del_e0 del_emax del_ed50 del_h]
end
function DED_auto(p, rho)
return ForwardDiff.jacobian(θ -> ED(θ, rho), p.θ)
end
# quickly check for implementation error
DED_manual(parameters(prior_sample)[1], 10) .- DED_auto(parameters(prior_sample)[1], 10)1×4 Matrix{Float64}:
0.0 0.0 0.0 -6.93889e-18# optimal for ED_{90}
dp2 = DesignProblem(dp1; transformation = DeltaMethod(p -> DED_auto(p, 90)))
solve(dp2, dummy; maxweight = 1e-3, maxdist = 1e-2)
Random.seed!(54321)
@time s3, r3 = solve(dp2, strategy; maxweight = 1e-3, maxdist = 1e-2) 1.651924 seconds (413.43 k allocations: 20.712 MiB)# optimal for ED_{90}
dp3 = DesignProblem(dp1; transformation = DeltaMethod(p -> DED_manual(p, 90)))
solve(dp3, dummy; maxweight = 1e-3, maxdist = 1e-2)
Random.seed!(54321)
@time s4, r4 = solve(dp3, strategy; maxweight = 1e-3, maxdist = 1e-2) 1.627623 seconds (410.41 k allocations: 19.932 MiB)As we see, both versions take practically the same amount of time.
s3 == s4true