Skip to content

net: resolver should call res_init when resolv.conf changes #21083

Open
@oconnor663

Description

@oconnor663

If you force the cgo resolver on non-Debian-based Linux with glibc <=2.25 (that is, any stable glibc as of this writing) and Go 1.8.3 (probably any released version), it's very easy to get into a state where all network connections fail even though the machine's network is up:

package main

import (
	"fmt"
	"net"
	"time"
)

func main() {
	i := 1
	for {
		ip, err := net.LookupIP("google.com")
		if err != nil {
			fmt.Printf("%3d error: %s\n", i, err)
		} else {
			fmt.Printf("%3d success: %s\n", i, ip)
		}
		time.Sleep(1 * time.Second)
		i++
	}
}
  1. Turn off / unplug the network connection on your machine.
  2. Observe that NetworkManager or whatever daemon on your machine updates /etc/resolv.conf and removes the nameservers. (This does depend on what Linux distro / local network configuration you have, so if /etc/resolv.conf never changes you might need to repro this on a different machine.)
  3. GODEBUG=netdns=cgo go run test.go (or whatever you named it) on the file above. The GODEBUG setting forces the net library to use the cgo resolver. Note that the pure-Go resolver does not exhibit this bug.
  4. Watch it for a few seconds and observe a couple lookup errors (no such host). This is expected, because the network is down.
  5. Without stopping the test program above, turn the network back on.
  6. Confirm again that /etc/resolv.conf changes, and that your nameservers are back.
  7. Keep watching the output of the test program.
    BUG: The test program's DNS requests keep failing. Sometimes they fix themselves eventually, I'm not sure under exactly what conditions, but sometimes I never see them recover. It seems like if the test program ran with the network off for a minute, recovery is less likely than if you turned the network back on after a couple seconds.

This seems to be related to a bug in glibc, where /etc/resolv.conf isn't checked for changes as often as it should be. It looks like glibc is going to ship a fix for this in their next release, though affected versions will still be around for years. Debian has patched this fix themselves for a long time (this patch?), which is why I don't think the bug will repro on Debian-based distros. I've recently had to work around this issue in application code, by calling libc::res_init manually. The Rust standard library has started calling res_init after DNS failures by default, and that issue links to similar workarounds in other languages and applications.

Seems like a lot of big programs have gone through the pain of re-discovering this issue. Here's Mozilla Firefox from 14 years ago. And more recently, Chef (and Ruby).

I think the Go standard library should consider adding a workaround similar to what Rust has done. It looks like this was considered in #10850, but it might not have been clear how common this bug is? The most painful case is a long-running Go process on a laptop where the WiFi comes and goes.

[Apologies if the "proposal" label was incorrect. This could be called a bugfix, depending on how you feel about it :) ]

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsFixThe path to resolution is known, but the work has not been done.ProposalProposal-Acceptedearly-in-cycleA change that should be done early in the 3 month dev cycle.help wanted

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions