zerforschen.plus/content/posts/tiny-binaries-rust.md
2025-04-09 23:58:25 +02:00

22 KiB
Raw Permalink Blame History

+++ date = '2025-04-09T20:29:48+02:00' draft = false title = 'Optimizing Rust binary size' tags = ['rust', 'servicepoint'] +++

In CCC Berlin, there is a large pixel matrix hanging on the wall that we call "ServicePoint display". It receives commands from the local network via UDP, which contain things like very basic text rendering and overwriting parts of the pixel buffer. The commands are sent in an efficient binary data structure. I wrote (most of) the Rust library servicepoint, which implements this protocol including serialisation, deserialisation and a bunch of extras for easily crating these packets. There are also bindings for other languages, including C.

A few weeks ago, the only user of those C bindings I know informed me, with a big grin on their face, that they'd stop using the library and instead wanted to write everything by hand. While I know from experience that writing such a library is great fun (and thus does not need another reason), I was intrigued and wanted to know why. The main reason they cited was binary size, and while there's probably something wrong with your computer if you do not have 1 MB to spare, I agreed that it was too big for what it does and that I would investigate. They knew what they were doing and it worked, I was immediately nerd-sniped and could not think about anything else in my spare time for a whole week. I had to find out why it was so big, and there would have to be a way to fix it.

This is part one, where I optimize the core library for size for fun and experience. The order in which I tried all the options is changed for a better text structure, but the results are re-created in the order they appear using the stated tools. In a future post, I also want to document how I got the C bindings smaller, as those use all features by default and cannot be reasoned about as much by the Rust compiler.

Most of the techniques I used are descibed in Minimizing Rust Binary Size, though I hope the specific example I provide makes the topic interesting to readers not writing Rust code.

Let's get hacking!

Starting point

The commit I started on was fe67160974d9fed542eb37e5e9a202eaf6fe00dc, which is not part of main as of the writing of this post.

As I needed some binary to compare, I chose the example announce:

//! An example for how to send text to the display.

/// [1]
use clap::Parser;
use servicepoint::{
    CharGrid, CharGridCommand, ClearCommand, Connection, UdpConnection,
    TILE_WIDTH,
};

/// [2]
#[derive(Parser, Debug)]
struct Cli {
    #[arg(short, long, default_value = "localhost:2342",
        help = "Address of the display")]
    destination: String,
    #[arg(short, long, num_args = 1.., value_delimiter = '\n',
        help = "Text to send - specify multiple times for multiple lines")]
    text: Vec<String>,
    #[arg(short, long, default_value_t = true,
        help = "Clear screen before sending text")]
    clear: bool,
}

/// example: `cargo run -- --text "Hallo" --text "CCCB"`
fn main() {
    /// [3]
    let mut cli = Cli::parse();
    if cli.text.is_empty() {
        cli.text.push("Hello, CCCB!".to_string());
    }

    /// [4]
    let connection = UdpConnection::open(&cli.destination)
        .expect("could not connect to display");

    /// [5]
    if cli.clear {
        connection.send(ClearCommand).expect("sending clear failed");
    }

    let text = cli.text.join("\n"); /// [6]
    let command: CharGridCommand = CharGrid::wrap_str(TILE_WIDTH, &text).into(); /// [7]
    connection.send(command).expect("sending text failed"); /// [8]
}

Let's just run you through the program quickly.

  1. Some imports of the used libraries.
  2. The structure Cli is defined to hold the command line arguments. clap is used to automatically derive a Parser from the attributes on the fields.
  3. The command line arguments are parsed and a default value for the text to send is set.
  4. A UDP connection is opened1
  5. Depending on the arguments, the screen is cleared.
  6. All text snippets provided as an argument are concatenated with newlines in between. --text "Hallo" --text "CCCB" turns into Hallo\nCCCB.
  7. The string is wrapped to the width of the display, resulting in a CharGrid, which is then immediately turned into a CharGridCommand. No fields are changed after this, so the text will be rendered in the top left of the screen when executed on the display.
  8. The command is sent to the display.

At some steps, the program panics with an error message in case something went wrong.

I started with rustc 1.82.0 (f6e511eec 2024-10-15) from nixpkgs 0ff09db9d034a04acd4e8908820ba0b410d7a33a. For compiling the example, I just used the usual cargo build --release --example announce and checked the binary size with ll -B target/release/examples.

The resulting size was 1.1 MB, which should be easy enough to beat.

Low hanging fruits

Compiler options

The first thing that came to mind was telling the compiler to optimize for size, like with gcc -Os. The Rust equivalent is opt-level = "s", and for even more optimization, z also disables loop vectorization.

Option size in isolation (change) size cumulative (change)
baseline 1.137.384 1.137.384
opt-level = 'z' 1.186.104 1.186.104
opt-level = 's' 1.120.416 1.120.416
lto = true 914.496 808.528
codegen-units = 1 982.904 775.888
panic = 'abort' 979.840 703.096
strip = true 915.944 580.056
switching back to opt-level = 'z' 555.480

So it turns out, if you want to halve your binary size, a few flags are enough in stable Rust. The most significant impacts came from link time optimization (LTO) and stripping of symbols from the binary. Interestingly, differnet combinations of these settings didn't scale the way I would have intuitively thought.

The only compromise apart from compilation time with these settings is the change in panic behavior, as this means no stack traces on crash2.

To only compile like this in specific scenarios, you can add a new profile to a crates Cargo.toml like this:

[profile.size-optimized]
inherits = "release"
opt-level = 's'     # Optimize for size
lto = true          # Enable link-time optimization
codegen-units = 1   # Reduce number of codegen units to increase optimizations
panic = 'abort'     # Abort on panic
strip = true        # Strip symbols from binary

The profile can be used by passing --profile=size-optimized instead of --release to cargo build. Because of the different profile, the binary ends up in a different folder (ll -B target/size-optimized/examples to check size).

Features

Rust has a very handy way to represent variability in a library called features. The servicepoint library has the following declaration in it's Cargo.toml:

[features]
default = ["compression_lzma", "protocol_udp", "cp437"]
compression_zlib = ["dep:flate2"]
compression_bzip2 = ["dep:bzip2"]
compression_lzma = ["dep:rust-lzma"]
compression_zstd = ["dep:zstd"]
all_compressions = ["compression_zlib", "compression_bzip2", "compression_lzma", "compression_zstd"]
rand = ["dep:rand"]
protocol_udp = []
protocol_websocket = ["dep:tungstenite"]
cp437 = ["dep:once_cell"]

Line two means by default, cargo will enable LZMA compression, sending via UDP sockets and conversion between CP-437 and UTF-8. Each of those features pulls in an optional dependency (which is why I made those features toggleable in the first place). In the code, CP-437 and compression are not needed3, but UDP is obviously used.

Features can be toggled on the command line, which means the invocation can be changed to the following: cargo build --example announce --profile=size-optimized --no-default-features --features=protocol_udp4. Doing that means less library code and less dependencies are pulled into the compilation process.

The result is a 555.480 Byte binary, which is exactly the same as without those flags. This is not really surprising, as we enabled a bunch of compiler options that help remove whole sections of code that are not needed, especially link time optimization. It is cool to see that the binary is identical, though.

In the rest of this post, I will omit those parameters, probably to the detriment of compilation time.

Digging deeper

While this was a big improvement already, this was still 50 times the size of the C program.

If it was this easy halving it, can I do that a second time?

Everything from here on required unstable features of the rust toolchain, both because tooling depends on it for more information about the program, and because the compiler options from here on are (and maybe never will be) stabilized.

The version I ended up with was rustc 1.88.0-nightly (5e17a2a91 2025-04-05). In my environment, I had to call nightly cargo with rustup run nightly cargo, but that part is not included in the rest of the commands. The executables I got with the unstable version were already a bit smaller again (546.528 bytes).

The first thing I noticed was that I got some new warnings when compiling, all of which I fixed immediately. As it was mostly inside of the documentation, I did not expect this to affect file size.

Next up, I added cargo-bloat to my flake. This tool can show you which functions take up most of the space in your binary. The invocation is similar to building - cargo bloat --example announce --profile=size-optimized resulted in the following output:

File  .text     Size        Crate Name
 1.0%   5.5%  21.0KiB clap_builder clap_builder::parser::parser::Parser::get_matches_with
 0.9%   5.3%  20.5KiB          std std::backtrace_rs::symbolize::gimli::Cache::with_global
 0.6%   3.3%  12.6KiB          std std::backtrace_rs::symbolize::gimli::Context::new
 0.4%   2.4%   9.2KiB          std gimli::read::dwarf::Unit<R>::new
 0.4%   2.1%   7.9KiB          std addr2line::line::LazyLines::borrow
 0.3%   2.0%   7.5KiB     announce announce::main
 0.3%   1.8%   7.1KiB          std miniz_oxide::inflate::core::decompress
 0.3%   1.6%   6.3KiB          std addr2line::unit::ResUnit<R>::find_function_or_location::{{closure}}
 0.3%   1.5%   5.6KiB clap_builder clap_builder::builder::command::Command::_build_self
 0.2%   1.4%   5.3KiB clap_builder clap_builder::output::help_template::HelpTemplate::write_templated_help
 0.2%   1.3%   5.1KiB clap_builder clap_builder::error::Error<F>::print
 0.2%   1.3%   4.9KiB clap_builder clap_builder::parser::parser::Parser::react
 0.2%   1.2%   4.8KiB clap_builder clap_builder::output::help_template::HelpTemplate::write_args
 0.2%   1.2%   4.6KiB          std gimli::read::unit::parse_attribute
 0.2%   1.1%   4.4KiB          std addr2line::function::Function<R>::parse_children
 0.2%   1.0%   3.7KiB clap_builder clap_builder::output::help_template::HelpTemplate::write_subcommands
 0.2%   1.0%   3.7KiB clap_builder clap_builder::output::usage::Usage::write_arg_usage
 0.2%   1.0%   3.7KiB          std gimli::read::rnglists::RngListIter<R>::next
 0.1%   0.8%   3.1KiB          std std::backtrace_rs::symbolize::gimli::elf::<impl std::backtrace_rs::symbolize::gimli::Mapping>::new_debug
 0.1%   0.8%   3.0KiB clap_builder clap_builder::parser::parser::Parser::match_arg_error
10.8%  61.8% 237.3KiB              And 993 smaller methods. Use -n N to show more.
17.5% 100.0% 384.2KiB              .text section size, the file size is 2.1MiB

Starting with the largest, the biggest functions in the program are shown. From the table, we can already see some interesting stuff.

  1. For some reason, the .text section (the machine code) is only a small part of the executable, and the total size increased by a factor of 4.
  2. The biggest function and a bunch of other big ones are from clap_builder, a crate that is part of the command line argument parser.
  3. std thakes up most of the rest.
  4. main is unexpectedly huge?
  5. servicepoint does not even show up in the top list.

Let's cover those points in order.

1. Unexpected binary size when building via cargo-bloat

Using GNU size, we can check the size per section in the ELF binary. Using -G or -B output formats does not work for this, as it will only show the .text and .data section, which in this case only make up around 500KB. Thus the command I used was size -A --common target/size-optimized/examples/announce, giving the following result:

section                size     addr
.dynsym                1680      856
.dynstr                1198     3500
.rela.dyn             22800     4704
.gcc_except_table      3728    27552
.rodata               36592    31280
.eh_frame_hdr          8116    67872
.eh_frame             52488    75992
.text                393449   132576
.data.rel.ro          18760   530248
.relro_padding         2888   550072
.data                  2400   554168
.debug_abbrev          1810        0
.debug_info          525404        0
.debug_aranges         6256        0
.debug_ranges        157856        0
.debug_str           726991        0
.debug_line          149936        0
Total               2115147

(I filtered out the rows <1KB for brevity)

Turns out cargo-bloat disables symbol stripping, because it needs those to show to the user. It's not even the symbols that are included in release builds by default - all the debugging information is included. That means, I can ignore that problem and focus on the .text size.

2. Removing clap

While clap is super handy, it looks like the code needed to parse two simple arguments blows up the executable. As the C program I was comparing against had all the parameters hard-coded, I just ripped out the dependency and hard-coded the values I needed.

The result is the first version of tiny_announce, as I did not want to change the existing example.

//! An example for how to send text to the display.

use servicepoint::{
    CharGrid, CharGridCommand, ClearCommand, Connection, UdpConnection,
    TILE_WIDTH,
};

/// example: `cargo run -- --text "Hallo" --text "CCCB"`
fn main() {
    let text = "Hello, CCCB!";

    let connection = UdpConnection::open("127.0.0.1:2342")
        .expect("could not connect to display");

    connection.send(ClearCommand).expect("sending clear failed");

    let command: CharGridCommand = CharGrid::wrap_str(TILE_WIDTH, &text).into();
    connection.send(command).expect("sending text failed");
}

The command to compile changed slightly because of the new name. cargo build --example tiny_announce --profile=size-optimized && ll -B target/size-optimized/examples/tiny_announce gave me the new binary size. Drumroll... 324.624 Bytes! With argument parsing removed, we saved 40% of the remaining binary size. This also makes the main disappear from the top sized functions for now.

While removing a library you do not really need is also available in stable Rust, I was only able to notice that with tooling only available on nightly, so I am putting it into that category.

3. build-std

Looking at the biggest functions again (now cargo bloat --example tiny_announce --profile=size-optimized) showed that all the big functions left were from std. Most of that looked like stack unwinding and debug data parsing, which is odd as we added panic = 'abort' in the first chapter.

As it turns out, as an optimization for the development workflow, by default cargo does not recompile the standard library. Instead, a prebuilt version included in the toolchain is used. The compiler arguments for that are fixed, and to change that and get more control over how stdlib is compiled and linked, we neeed the unstable option -Zbuild-std and have to list which sub-crates we want to build (which is pretty much all of them). Because we also have panic = "abort" set, we need to also pass in -Zbuild-std-features="panic_immediate_abort" so there is no compilation error.

cargo build --example tiny_announce --profile=size-optimized -Zbuild-std="core,std,alloc,proc_macro,panic_abort" -Zbuild-std-features="panic_immediate_abort"

This produces a binary that is now only 30.992 bytes!

In-between find: to_socket_addrs

The remaining top 3 functions were:

 File  .text    Size         Crate Name
 4.4%  11.0%  2.0KiB           std <&T as std::net::socket_addr::ToSocketAddrs>::to_socket_addrs
 3.8%   9.4%  1.7KiB tiny_announce tiny_announce::main
 2.9%   7.3%  1.4KiB     [Unknown] main

Main shows up again! But what is that? 4.4% used by to_socket_addrs? We found the last string parsing code, this time in the standard library, to read the IP and Port from a string. After changing it in the example, it still showed up which brings me to the first and only time I actually changed the servicepoint library as a result from this saga.

- let socket = UdpSocket::bind("0.0.0.0:0")?;
+ let addr = SocketAddr::from(([0, 0, 0, 0], 0));
+ let socket = UdpSocket::bind(addr)?;

This also seemed to remove other functions as well, as the size was down to 17.272 bytes, nearly halving the size again. It is now smaller than this article as plain text markdown.

4. no_main

You'd think that now main is the top function, but Iter::next is now the biggest function for some reason. Still, [Unknown] main and the actual main take up 10% of the remaining size according to cargo bloat.

We surely cannot reduce that, right? Wrong! With #[no_main], you can tell Rust to not add any initialization code. This means the normal fn main() does not get used, and the linker complains about the missing function. To fix this, the function can be converted to a C-style main.

I also removed some more code by initializing the CharGrid directly instead of wrapping a string, which saved 400 bytes.

#![no_main]

use servicepoint::{
    CharGrid, CharGridCommand, ClearCommand, Connection, UdpConnection,
};
use std::net::SocketAddr;

#[unsafe(no_mangle)]
pub extern "C" fn main(_argc: isize, _argv: *const *const u8) -> isize {
    // not parsing the address from str removes 3KB
    let addr = SocketAddr::from(([172, 23, 42, 29], 80));

    let connection = UdpConnection::open(addr).unwrap();
    connection.send(ClearCommand).unwrap(); // <--
    
    let grid = CharGrid::from_vec(5, vec!['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']).unwrap();
    connection.send(CharGridCommand::from(grid)).unwrap();
    0
}

This resulted in a 8.064 byte executable, finally beating both GCC and LLVM compiling the minimal C program (around 10KB). If we were to remove the marked line and not clear the screen, we could drop it further to 7.696 bytes.

Advanced compiler abuse

There are two things left to reach the absolute bottom without ripping out the standard libary alltogether.

In Rust, a function can tell the compiler to get the calling location as a parameter to the function. With -Zlocation-detail=none, we instruct the Rust compiler to just not bother with that.

-Zfmt-debug=none is similar but worse, because it changes all the default Debug implementations to do nothing at all. The change in behavior is not obvious in this example, but do this in an application that has logging and it will be horribly broken.

As an icing on the cake, those two options cannot be passed via cargo arguments, so we have to use the environment variable RUSTFLAGS to pass this through to when rustc is invoked.

The final command to build the tiniest possible announce in all it's glory:

RUSTFLAGS="-Zlocation-detail=none -Zfmt-debug=none" \
cargo build \
    --example tiny_announce \
    --profile=size-optimized \
    --no-default-features \
    --features=protocol_udp \
    -Zbuild-std="core,std,alloc,proc_macro,panic_abort" \
    -Zbuild-std-features="panic_immediate_abort"

All of this reduces the final binary size to 8.064 bytes.

Conclusion

Through this journey, I've managed to reduce the size of the binary from an unwieldy 1.1 MB to an impressive 8 KB, without sacrificing all of the standard library. For me the most unexpected was the size of clap code, though I learned dozens of things at every step about the intricacies of how cargo build produces your final binary.

There is no single option that in itself is the solution, its a matter of experimenting with a combination of compiler flags, feature toggles, and code optimizations. While extreme options can be great if you want to squeeze out the last bytes, it's probably not worth using those in a "normal" computer scenario.

The key takeaway is that optimizing binary size in Rust, while not always straightforward, is achievable with the right techniques. It is certainly easier to create a big binary than in C, calling Rust bloated is blaming the language a bit too much.

Stay tuned for part two, in which I will try to do something similar with a C version of the example, using the C bindungs of the servicepoint crate.


  1. Yes, I know UDP does not have connections. Internally, this just opens a UDP socket ↩︎

  2. Technically, you can catch a panic while unwinding and there may even be a weird performance argument for doing that, see ↩︎

  3. Some commands can be compressed, but the text ones (both CP-437 and UTF-8) cannot. Clear is a very simple command that does not have any payload, so no compression there either. If a BitmapCommand was used instead, using into() on a Bitmap would have hidden the fact that the default compression is used in that case. The default compression in turn is either LZMA or no compression, depending on whether the LZMA feature is enabled. ↩︎

  4. This works here because announce is an example inside of the library itself. As an actual dpendent, you would specify this in your Cargo.toml. ↩︎