Description
If you force the cgo resolver on non-Debian-based Linux with glibc <=2.25 (that is, any stable glibc as of this writing) and Go 1.8.3 (probably any released version), it's very easy to get into a state where all network connections fail even though the machine's network is up:
package main
import (
"fmt"
"net"
"time"
)
func main() {
i := 1
for {
ip, err := net.LookupIP("google.com")
if err != nil {
fmt.Printf("%3d error: %s\n", i, err)
} else {
fmt.Printf("%3d success: %s\n", i, ip)
}
time.Sleep(1 * time.Second)
i++
}
}
- Turn off / unplug the network connection on your machine.
- Observe that NetworkManager or whatever daemon on your machine updates
/etc/resolv.conf
and removes the nameservers. (This does depend on what Linux distro / local network configuration you have, so if/etc/resolv.conf
never changes you might need to repro this on a different machine.) GODEBUG=netdns=cgo go run test.go
(or whatever you named it) on the file above. TheGODEBUG
setting forces the net library to use the cgo resolver. Note that the pure-Go resolver does not exhibit this bug.- Watch it for a few seconds and observe a couple lookup errors (
no such host
). This is expected, because the network is down. - Without stopping the test program above, turn the network back on.
- Confirm again that
/etc/resolv.conf
changes, and that your nameservers are back. - Keep watching the output of the test program.
BUG: The test program's DNS requests keep failing. Sometimes they fix themselves eventually, I'm not sure under exactly what conditions, but sometimes I never see them recover. It seems like if the test program ran with the network off for a minute, recovery is less likely than if you turned the network back on after a couple seconds.
This seems to be related to a bug in glibc, where /etc/resolv.conf
isn't checked for changes as often as it should be. It looks like glibc is going to ship a fix for this in their next release, though affected versions will still be around for years. Debian has patched this fix themselves for a long time (this patch?), which is why I don't think the bug will repro on Debian-based distros. I've recently had to work around this issue in application code, by calling libc::res_init
manually. The Rust standard library has started calling res_init
after DNS failures by default, and that issue links to similar workarounds in other languages and applications.
Seems like a lot of big programs have gone through the pain of re-discovering this issue. Here's Mozilla Firefox from 14 years ago. And more recently, Chef (and Ruby).
I think the Go standard library should consider adding a workaround similar to what Rust has done. It looks like this was considered in #10850, but it might not have been clear how common this bug is? The most painful case is a long-running Go process on a laptop where the WiFi comes and goes.
[Apologies if the "proposal" label was incorrect. This could be called a bugfix, depending on how you feel about it :) ]