The Story of a Rust Bug (8 min. read)

Posted on — shared on Hacker News Twitter Lobsters Reddit

A while ago, I built a system-tray application in Rust to notify me of new e-mail called buzz. It was working fine, but every now and again, it would fail to connect to the mail server on boot if it started before my network connection was up. Sounds easy enough to fix, so I just added a loop around my connect (even made it have exponential backoff!) and considered the issue dealt with.

Fast forward a few days, and the same issue happens again. I boot my computer, buzz starts before my network is ready, and while it now continues trying to connect for a while, each attempt fails with the same error: “Name or service not known”. Hmm… I kill buzz and restart it, and lo and behold it connects immediately without issue. What sorcery is this?

It’s time to do some debugging. First, let’s write a sample program:

fn main() {
    use std::thread;
    use std::time::Duration;
    use std::net::TcpStream;

    loop {
        match TcpStream::connect("google.com:80") {
            Ok(_) => {
                println!("connected");
                break;
            }
            Err(e) => {
                println!("failed: {:?}", e);
            }
        }
        thread::sleep(Duration::from_secs(1));
    }
}

Running it prints “connected” — great. Now, let’s disconnect the internet and run it again — it prints

failed: Name or service not known

over and over again. Okay, still as expected. Now let’s turn the internet back on

failed: Name or service not known
failed: Name or service not known
failed: Name or service not known
failed: Name or service not known

Something is definitely fishy. So, let’s ask Google:

linux c connect before interface comes up and then retry

No particularly promising results there…

linux c connect before interface comes up and then retry Name or service not known

Still unhelpful.

linux internet connectivity retry connection failed

This takes us to a somewhat promising Thunderbird issue named “Thunderbird “Failed to connect to server” after connecting to Internet”, but it fails to reach any helpful conclusions. But we shall not surrender!

“Name or service not known” after interface comes up

Ooooh, “#2825 (Pidgin cannot reconnect after changing networks)” looks promising. Among the comments:

This is an issue when NetworkManager isn’t around. res_init (rereads /etc/resolv.conf) is only called in the NM path.

/etc/resolv.conf is where Linux keeps track of the DNS nameservers to use when looking up the IP addresses of domain names. Depending on your network configuration, that file is empty when you are offline, and then entries are filled in when you connect to a network. The comment suggests that the contents of this file is cached, which would mean our program never learns of any nameservers (it always see the empty list), and so all DNS resolution fails for all eternity!

Armed with the knowledge about res_init, our Google searches are suddenly a lot more helpful, and reveal that this problem is actually something many projects have encountered. Let’s see what Rust does, why our program doesn’t work, and how we might fix it.

TcpStream::connect takes an argument that implements ToSocketAddrs. “Implementors” at the bottom of that page shows that ToSocketAddrs is implemented for str, which makes a lot of sense given that we’re passing in a str to it in our code above. Let’s click the src link at the top right of the page to see what it’s doing to turn that string into an IP address!

Scrolling down a little, we see that it calls resolve_socket_addr, which is defined a bit further up. It again calls lookup_host, which seems to just be a stub of some sort:

use sys_common::net as net_imp;
// ...
pub fn lookup_host(host: &str) -> io::Result<LookupHost> {
    net_imp::lookup_host(host).map(LookupHost)
}

That’s not very helpful. It’s time we go explore the rust GitHub repository. There’s a lot of stuff here, but let’s just try to take the fast path to what we’re looking for. The use sys_common::net gives us a starting point: use statements like these inside the standard library are using other modules from the standard library, which lives in src/libstd. And how about that, right there there’s a little sys_common subdirectory. Let’s open that up. And then we want net.rs, because that’s the module the lookup_host function was using.

(At this point, it’s worth noting that I’ve cheated a little. Since my fix for this bug has now landed, if you actually follow the path on GitHub, you’ll see the new version of net.rs. If you want to see what I saw, click the linked net.rs above instead.)

So, lookup_host calls c::getaddrinfo, and then returns Ok. There’s a call to cvt_gai in there suffixed with a ?, which I assume deals with the case where the lookup fails, but let’s ignore that for a second. There’s no call to res_init here. This means that unless the application calls res_init itself, it will simply never get to use the internet. That’s pretty sad. Let’s fix that!

The most straightforward fix is to just call res_init directly from our application if connect fails. But, in order to do that, we, well, need to be able to call res_init. res_init is a function in libc (at least on UNIX-like systems), so the place to look would be the Rust libc crate. If you look today, there is a res_init function in libc, but this was not the case when I looked. So, time to file a pull request!

The libc README clearly states the process for adding an API, and it basically comes down to “add a function to src/unix/mod.rs, submit, and then fix failing tests”. It turns out that res_init is actually somewhat funky as far as libc functions go, so it took quite a bit of digging to get it right on all the UNIX-y platforms that libc supports. But with the aid of the amazing Alex Crichton, green checkmarks eventually started appearing, and PR#585 landed (if you want to know what the process is like, I encourage you to read through the comments there).

Okay, so we now have libc::res_init, which means we can fix our application by adding a dependency on libc and calling the function manually after each failed connection attempt. While this would work, it doesn’t feel particularly elegant. And what about other people who will inevitably also run into the same issue? No, we can to better. Time to fix Rust!

First, I filed #41570, an issue outlining the issue, giving much of the same reasoning and examples that I’ve given in this post. I actually did that before my libc PR, but that’s sort of beside the point. I then asked for opinions about what the best place to implement the fix would be, suggesting the lookup_host function we found above. Alex Crichton responded (again!), and PR#41582 was born. I’ll spare you some of the details (read the comments if you want them), but two primary changes were needed:

It took a few iterations to get the kinks ironed out, and all os targets to be happy, but on May 5th at 5:35pm, the Rust build system accepted and merged my PR! A few hours later, at midnight UTC, a new nightly release of Rust was published, which included my fix. After a quick rustup update and a recompile, buzz now works correctly without any changes to the code! Yay progress!

Hopefully this post has given some insight into what is involved in making a contribution to the Rust standard library, and may give you some pointers to what you might do if you find something you would like to fix in Rust! It doesn’t even have to involve coding — the Rust team would love documentation changes too. Happy hacking!