Description
So, you found this issue googling for "cannot unmarshal DNS"
There's good news: your issue has largely been fixed. The issue below was created initially because I discovered it in my network and operating system, but further discovery found that this issue has affected every major OS and users of VPNs, DNS providers written in Go, and more.
If you are a maintainer of code and someone has reported this issue: if you can update your build system to use Go 1.16.15 or 1.17.8, or Go 1.18, then you should see this go away and solve your users' issues.
If you are a user of a program and see this error, you need to ask the maintainer or creator of that package to do likewise. Unfortunately, there isn't a single set of instructions I can give for a workaround. If you're using a VPN, try using that program not on a VPN; that seems to be the most common user-reported scenario I've seen.
Original bug report:
What version of Go are you using (go version
)?
$ go version go version go1.17.6 linux/amd64
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (go env
)?
Note: WSL2 on Windows. This is relevant, but not the sole scenario in which it can occur, see below.
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/friel/.cache/go-build" GOENV="/home/friel/.config/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/home/friel/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/friel/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/home/friel/.local/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/home/friel/.local/go/pkg/tool/linux_amd64" GOVCS="" GOVERSION="go1.17.6" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/home/friel/go/src/github.com/pulumi/pulumi-yaml/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3112884807=/tmp/go-build -gno-record-gcc-switches"
What did you do?
Use infrastructure as code tools to manage Azure, and/or attempt to execute net.LookupIP("management.azure.com")
.
Example program:
package main
import (
"fmt"
"net"
)
func main() {
ips, err := net.LookupIP("management.azure.com")
if err != nil {
panic(err)
}
for _, ip := range ips {
fmt.Printf("%v", ip)
}
}
What did you expect to see?
I expected to see the current IP, 13.86.219.80, as shown by the last line of:
$ host management.azure.com
management.azure.com is an alias for management.privatelink.azure.com.
management.privatelink.azure.com is an alias for arm-frontdoor-prod.trafficmanager.net.
arm-frontdoor-prod.trafficmanager.net is an alias for westus.management.azure.com.
westus.management.azure.com is an alias for arm-frontdoor-westus.trafficmanager.net.
arm-frontdoor-westus.trafficmanager.net is an alias for westus.cs.management.azure.com.
westus.cs.management.azure.com is an alias for rpfd-prod-by-01.cloudapp.net.
rpfd-prod-by-01.cloudapp.net has address 13.86.219.80
What did you see instead?
$ go run resolve-test.go
panic: lookup management.azure.com on 172.20.32.1:53: cannot unmarshal DNS message
goroutine 1 [running]:
main.main()
/home/friel/c/resolve-test/resolve-test.go:11 +0xe8
exit status 2
Miscellany
It looks like this issue is widely affecting infrastructure as code tools such as Pulumi, Terraform, and others when they make API calls to Microsoft Azure on the Windows Subsystem for Linux 2, on Microsoft Windows.
This is a bit of a rock and a hard place situation. Microsoft is unlikely to update their DNS server to adhere to the pre-1999 DNS specification. The Go language team is in a position to be much more agile and issue a point release update to support a larger buffer size, even just going up to a single standard MTU of ~1500 bytes would resolve this issue in the near term.
As this problem primarily affects programs written in Go, in this author's estimation it seems unlikely a change in Windows' DNS server behavior could occur as quickly, even if the stars were to align on the need to change the implementation. Note that host
, dig
, nslookup
, etc all behave correctly.
Collected notes and root cause analysis:
- Microsoft Windows WSL2 uses a DNS server that sends additional metadata causing it to send responses larger than expected, & will exceed the 512 byte response size mandated by DNS RFC (DNS server mixes AUTHORITY/ADDITIONAL section into ANSWER section while responding to queries microsoft/WSL#5806, WSL Internet Connection Sharing DNS resolver does not adhere to 512 byte UDP limit microsoft/WSL#7642)
- Golang's net/dns resolver applies a strict 512 byte limit to the buffer it will fill with a response (net: issue with DNS response > 512 bytes (cannot unmarshal DNS message) #21160, net: "cannot unmarshal DNS message" when using netdns=go under Windows/WSL2 running ubuntu 20.04 #44135)
- Microsoft Azure appears to have added a new cname to their management.azure.com endpoint, likely within the last week (?) pushing the response size, due to (1) over the 512 byte limit, and causing due to (2) a cannot unmarshal DNS message error. (DNS server mixes AUTHORITY/ADDITIONAL section into ANSWER section while responding to queries microsoft/WSL#5806 (comment), net: "cannot unmarshal DNS message" when using netdns=go under Windows/WSL2 running ubuntu 20.04 #44135 (comment), Error when using terraform inside WSL2 microsoft/WSL#8022, WSL Internet Connection Sharing DNS resolver does not adhere to 512 byte UDP limit microsoft/WSL#7642 (comment))
DNS Flag Day 2020 had an explicit goal of ensuring that resolvers had a minimum accepted buffer size of 1232 bytes: https://dnsflagday.net/2020/#action-dns-resolver-operators