411 lines
22 KiB
Markdown
411 lines
22 KiB
Markdown
+++
|
||
date = '2025-04-09T20:29:48+02:00'
|
||
draft = false
|
||
title = 'Optimizing Rust binary size'
|
||
tags = ['rust', 'servicepoint']
|
||
+++
|
||
|
||
In [CCC Berlin](https://berlin.ccc.de/), there is a large pixel matrix hanging on the wall that we call "ServicePoint display".
|
||
It receives commands from the local network via UDP, which contain things like very basic text rendering and overwriting parts of the pixel buffer.
|
||
The commands are sent in an efficient binary data structure.
|
||
I wrote (most of) the Rust library [servicepoint](https://crates.io/crates/servicepoint), which implements this protocol including serialisation, deserialisation and a bunch of extras for easily crating these packets.
|
||
There are also bindings for other languages, [including C](https://git.berlin.ccc.de/servicepoint/servicepoint-binding-c).
|
||
|
||
A few weeks ago, the only user of those C bindings I know informed me, with a big grin on their face, that they'd stop using the library and instead wanted to write everything by hand.
|
||
While I know from experience that writing such a library is great fun (and thus does not need another reason), I was intrigued and wanted to know why.
|
||
The main reason they cited was binary size, and while there's probably something wrong with your computer if you do not have 1 MB to spare, I agreed that it was too big for what it does and that I would investigate.
|
||
They knew what they were doing and it worked, I was immediately nerd-sniped and could not think about anything else in my spare time for a whole week.
|
||
I _had_ to find out why it was so big, and there would _have_ to be a way to fix it.
|
||
|
||
This is part one, where I optimize the core library for size for fun and experience.
|
||
The order in which I tried all the options is changed for a better text structure, but the results are re-created in the order they appear using the stated tools.
|
||
In a future post, I also want to document how I got the C bindings smaller, as those use all features by default and cannot be reasoned about as much by the Rust compiler.
|
||
|
||
Most of the techniques I used are descibed in [Minimizing Rust Binary Size](https://github.com/johnthagen/min-sized-rust), though I hope the specific example I provide makes the topic interesting to readers not writing Rust code.
|
||
|
||
Let's get hacking!
|
||
|
||
## Starting point
|
||
|
||
The commit I started on was [fe67160974d9fed542eb37e5e9a202eaf6fe00dc](https://git.berlin.ccc.de/servicepoint/servicepoint/src/tag/tiny-rust-binaries-before), which is not part of `main` as of the writing of this post.
|
||
|
||
As I needed some binary to compare, I chose the example [announce](https://git.berlin.ccc.de/servicepoint/servicepoint/src/tag/tiny-rust-binaries-before/examples/announce.rs):
|
||
|
||
```rust
|
||
//! An example for how to send text to the display.
|
||
|
||
/// [1]
|
||
use clap::Parser;
|
||
use servicepoint::{
|
||
CharGrid, CharGridCommand, ClearCommand, Connection, UdpConnection,
|
||
TILE_WIDTH,
|
||
};
|
||
|
||
/// [2]
|
||
#[derive(Parser, Debug)]
|
||
struct Cli {
|
||
#[arg(short, long, default_value = "localhost:2342",
|
||
help = "Address of the display")]
|
||
destination: String,
|
||
#[arg(short, long, num_args = 1.., value_delimiter = '\n',
|
||
help = "Text to send - specify multiple times for multiple lines")]
|
||
text: Vec<String>,
|
||
#[arg(short, long, default_value_t = true,
|
||
help = "Clear screen before sending text")]
|
||
clear: bool,
|
||
}
|
||
|
||
/// example: `cargo run -- --text "Hallo" --text "CCCB"`
|
||
fn main() {
|
||
/// [3]
|
||
let mut cli = Cli::parse();
|
||
if cli.text.is_empty() {
|
||
cli.text.push("Hello, CCCB!".to_string());
|
||
}
|
||
|
||
/// [4]
|
||
let connection = UdpConnection::open(&cli.destination)
|
||
.expect("could not connect to display");
|
||
|
||
/// [5]
|
||
if cli.clear {
|
||
connection.send(ClearCommand).expect("sending clear failed");
|
||
}
|
||
|
||
let text = cli.text.join("\n"); /// [6]
|
||
let command: CharGridCommand = CharGrid::wrap_str(TILE_WIDTH, &text).into(); /// [7]
|
||
connection.send(command).expect("sending text failed"); /// [8]
|
||
}
|
||
```
|
||
|
||
Let's just run you through the program quickly.
|
||
|
||
1. Some imports of the used libraries.
|
||
2. The structure `Cli` is defined to hold the command line arguments. [clap](https://crates.io/crates/clap) is used to automatically derive a `Parser` from the attributes on the fields.
|
||
3. The command line arguments are parsed and a default value for the text to send is set.
|
||
4. A UDP connection is opened[^1]
|
||
5. Depending on the arguments, the screen is cleared.
|
||
6. All text snippets provided as an argument are concatenated with newlines in between. `--text "Hallo" --text "CCCB"` turns into `Hallo\nCCCB`.
|
||
7. The string is wrapped to the width of the display, resulting in a `CharGrid`, which is then immediately turned into a `CharGridCommand`. No fields are changed after this, so the text will be rendered in the top left of the screen when executed on the display.
|
||
8. The command is sent to the display.
|
||
|
||
At some steps, the program panics with an error message in case something went wrong.
|
||
|
||
I started with `rustc 1.82.0 (f6e511eec 2024-10-15)` from nixpkgs `0ff09db9d034a04acd4e8908820ba0b410d7a33a`.
|
||
For compiling the example, I just used the usual `cargo build --release --example announce` and checked the binary size with `ll -B target/release/examples`.
|
||
|
||
The resulting size was 1.1 MB, which should be easy enough to beat.
|
||
|
||
## Low hanging fruits
|
||
|
||
### Compiler options
|
||
|
||
The first thing that came to mind was telling the compiler to optimize for size, like with `gcc -Os`. The Rust equivalent is `opt-level = "s"`, and for even more optimization, `z` also disables loop vectorization.
|
||
|
||
| Option | size in isolation (change) | size cumulative (change) |
|
||
| - | - | - |
|
||
| baseline | 1.137.384 | 1.137.384 |
|
||
| opt-level = 'z' | 1.186.104 | 1.186.104 |
|
||
| opt-level = 's' | 1.120.416 | 1.120.416 |
|
||
| lto = true | 914.496 | 808.528 |
|
||
| codegen-units = 1 | 982.904 | 775.888 |
|
||
| panic = 'abort' | 979.840 |703.096|
|
||
| strip = true | 915.944 | 580.056 |
|
||
| switching back to opt-level = 'z' | | 555.480 |
|
||
|
||
So it turns out, if you want to halve your binary size, a few flags are enough in stable Rust.
|
||
The most significant impacts came from link time optimization (LTO) and stripping of symbols from the binary.
|
||
Interestingly, differnet combinations of these settings didn't scale the way I would have intuitively thought.
|
||
|
||
The only compromise apart from compilation time with these settings is the change in panic behavior, as this means no stack traces on crash[^panic-abort].
|
||
|
||
To only compile like this in specific scenarios, you can add a new profile to a crates `Cargo.toml` like this:
|
||
|
||
```toml
|
||
[profile.size-optimized]
|
||
inherits = "release"
|
||
opt-level = 's' # Optimize for size
|
||
lto = true # Enable link-time optimization
|
||
codegen-units = 1 # Reduce number of codegen units to increase optimizations
|
||
panic = 'abort' # Abort on panic
|
||
strip = true # Strip symbols from binary
|
||
```
|
||
|
||
The profile can be used by passing `--profile=size-optimized` instead of `--release` to `cargo build`.
|
||
Because of the different profile, the binary ends up in a different folder (`ll -B target/size-optimized/examples` to check size).
|
||
|
||
### Features
|
||
|
||
Rust has a very handy way to represent variability in a library called features.
|
||
The `servicepoint` library has the following declaration in it's `Cargo.toml`:
|
||
|
||
```toml
|
||
[features]
|
||
default = ["compression_lzma", "protocol_udp", "cp437"]
|
||
compression_zlib = ["dep:flate2"]
|
||
compression_bzip2 = ["dep:bzip2"]
|
||
compression_lzma = ["dep:rust-lzma"]
|
||
compression_zstd = ["dep:zstd"]
|
||
all_compressions = ["compression_zlib", "compression_bzip2", "compression_lzma", "compression_zstd"]
|
||
rand = ["dep:rand"]
|
||
protocol_udp = []
|
||
protocol_websocket = ["dep:tungstenite"]
|
||
cp437 = ["dep:once_cell"]
|
||
```
|
||
|
||
Line two means by default, cargo will enable LZMA compression, sending via UDP sockets and conversion between CP-437 and UTF-8.
|
||
Each of those features pulls in an optional dependency (which is why I made those features toggleable in the first place).
|
||
In the code, CP-437 and compression are not needed[^2], but UDP is obviously used.
|
||
|
||
Features can be toggled on the command line, which means the invocation can be changed to the following: `cargo build --example announce --profile=size-optimized --no-default-features --features=protocol_udp`[^3].
|
||
Doing that means less library code and less dependencies are pulled into the compilation process.
|
||
|
||
The result is a 555.480 Byte binary, which is exactly the same as without those flags.
|
||
This is not really surprising, as we enabled a bunch of compiler options that help remove whole sections of code that are not needed, especially link time optimization.
|
||
It is cool to see that the binary is identical, though.
|
||
|
||
In the rest of this post, I will omit those parameters, probably to the detriment of compilation time.
|
||
|
||
## Digging deeper
|
||
|
||
While this was a big improvement already, this was still 50 times the size of the C program.
|
||
|
||
_If it was this easy halving it, can I do that a second time?_
|
||
|
||
Everything from here on required unstable features of the rust toolchain, both because tooling depends on it for more information about the program, and because the compiler options from here on are (and maybe never will be) stabilized.
|
||
|
||
The version I ended up with was `rustc 1.88.0-nightly (5e17a2a91 2025-04-05)`.
|
||
In my environment, I had to call nightly cargo with `rustup run nightly cargo`, but that part is not included in the rest of the commands.
|
||
The executables I got with the unstable version were already a bit smaller again (546.528 bytes).
|
||
|
||
The first thing I noticed was that I got some new warnings when compiling, all of which I fixed immediately. As it was mostly inside of the documentation, I did not expect this to affect file size.
|
||
|
||
Next up, I added cargo-bloat to my flake. This tool can show you which functions take up most of the space in your binary.
|
||
The invocation is similar to building - `cargo bloat --example announce --profile=size-optimized` resulted in the following output:
|
||
|
||
```
|
||
File .text Size Crate Name
|
||
1.0% 5.5% 21.0KiB clap_builder clap_builder::parser::parser::Parser::get_matches_with
|
||
0.9% 5.3% 20.5KiB std std::backtrace_rs::symbolize::gimli::Cache::with_global
|
||
0.6% 3.3% 12.6KiB std std::backtrace_rs::symbolize::gimli::Context::new
|
||
0.4% 2.4% 9.2KiB std gimli::read::dwarf::Unit<R>::new
|
||
0.4% 2.1% 7.9KiB std addr2line::line::LazyLines::borrow
|
||
0.3% 2.0% 7.5KiB announce announce::main
|
||
0.3% 1.8% 7.1KiB std miniz_oxide::inflate::core::decompress
|
||
0.3% 1.6% 6.3KiB std addr2line::unit::ResUnit<R>::find_function_or_location::{{closure}}
|
||
0.3% 1.5% 5.6KiB clap_builder clap_builder::builder::command::Command::_build_self
|
||
0.2% 1.4% 5.3KiB clap_builder clap_builder::output::help_template::HelpTemplate::write_templated_help
|
||
0.2% 1.3% 5.1KiB clap_builder clap_builder::error::Error<F>::print
|
||
0.2% 1.3% 4.9KiB clap_builder clap_builder::parser::parser::Parser::react
|
||
0.2% 1.2% 4.8KiB clap_builder clap_builder::output::help_template::HelpTemplate::write_args
|
||
0.2% 1.2% 4.6KiB std gimli::read::unit::parse_attribute
|
||
0.2% 1.1% 4.4KiB std addr2line::function::Function<R>::parse_children
|
||
0.2% 1.0% 3.7KiB clap_builder clap_builder::output::help_template::HelpTemplate::write_subcommands
|
||
0.2% 1.0% 3.7KiB clap_builder clap_builder::output::usage::Usage::write_arg_usage
|
||
0.2% 1.0% 3.7KiB std gimli::read::rnglists::RngListIter<R>::next
|
||
0.1% 0.8% 3.1KiB std std::backtrace_rs::symbolize::gimli::elf::<impl std::backtrace_rs::symbolize::gimli::Mapping>::new_debug
|
||
0.1% 0.8% 3.0KiB clap_builder clap_builder::parser::parser::Parser::match_arg_error
|
||
10.8% 61.8% 237.3KiB And 993 smaller methods. Use -n N to show more.
|
||
17.5% 100.0% 384.2KiB .text section size, the file size is 2.1MiB
|
||
```
|
||
|
||
Starting with the largest, the biggest functions in the program are shown.
|
||
From the table, we can already see some interesting stuff.
|
||
|
||
1. For some reason, the `.text` section (the machine code) is only a small part of the executable, and the total size increased by a factor of 4.
|
||
2. The biggest function and a bunch of other big ones are from `clap_builder`, a crate that is part of the command line argument parser.
|
||
3. `std` thakes up most of the rest.
|
||
4. `main` is unexpectedly huge?
|
||
5. `servicepoint` does not even show up in the top list.
|
||
|
||
Let's cover those points in order.
|
||
|
||
### 1. Unexpected binary size when building via cargo-bloat
|
||
|
||
Using `GNU size`, we can check the size per section in the ELF binary.
|
||
Using `-G` or `-B` output formats does not work for this, as it will only show the `.text` and `.data` section, which in this case only make up around 500KB.
|
||
Thus the command I used was ` size -A --common target/size-optimized/examples/announce`, giving the following result:
|
||
|
||
```
|
||
section size addr
|
||
.dynsym 1680 856
|
||
.dynstr 1198 3500
|
||
.rela.dyn 22800 4704
|
||
.gcc_except_table 3728 27552
|
||
.rodata 36592 31280
|
||
.eh_frame_hdr 8116 67872
|
||
.eh_frame 52488 75992
|
||
.text 393449 132576
|
||
.data.rel.ro 18760 530248
|
||
.relro_padding 2888 550072
|
||
.data 2400 554168
|
||
.debug_abbrev 1810 0
|
||
.debug_info 525404 0
|
||
.debug_aranges 6256 0
|
||
.debug_ranges 157856 0
|
||
.debug_str 726991 0
|
||
.debug_line 149936 0
|
||
Total 2115147
|
||
|
||
(I filtered out the rows <1KB for brevity)
|
||
```
|
||
|
||
Turns out `cargo-bloat` disables symbol stripping, because it needs those to show to the user.
|
||
It's not even the symbols that are included in release builds by default - _all_ the debugging information is included.
|
||
That means, I can ignore that problem and focus on the `.text` size.
|
||
|
||
### 2. Removing clap
|
||
|
||
While clap is super handy, it looks like the code needed to parse two simple arguments blows up the executable.
|
||
As the C program I was comparing against had all the parameters hard-coded, I just ripped out the dependency and hard-coded the values I needed.
|
||
|
||
The result is the first version of `tiny_announce`, as I did not want to change the existing example.
|
||
```
|
||
//! An example for how to send text to the display.
|
||
|
||
use servicepoint::{
|
||
CharGrid, CharGridCommand, ClearCommand, Connection, UdpConnection,
|
||
TILE_WIDTH,
|
||
};
|
||
|
||
/// example: `cargo run -- --text "Hallo" --text "CCCB"`
|
||
fn main() {
|
||
let text = "Hello, CCCB!";
|
||
|
||
let connection = UdpConnection::open("127.0.0.1:2342")
|
||
.expect("could not connect to display");
|
||
|
||
connection.send(ClearCommand).expect("sending clear failed");
|
||
|
||
let command: CharGridCommand = CharGrid::wrap_str(TILE_WIDTH, &text).into();
|
||
connection.send(command).expect("sending text failed");
|
||
}
|
||
|
||
```
|
||
|
||
The command to compile changed slightly because of the new name. `cargo build --example tiny_announce --profile=size-optimized && ll -B target/size-optimized/examples/tiny_announce` gave me the new binary size.
|
||
Drumroll... 324.624 Bytes!
|
||
With argument parsing removed, we saved 40% of the remaining binary size.
|
||
This also makes the main disappear from the top sized functions for now.
|
||
|
||
While removing a library you do not really need is also available in stable Rust, I was only able to notice that with tooling only available on nightly, so I am putting it into that category.
|
||
|
||
### 3. build-std
|
||
|
||
Looking at the biggest functions again (now `cargo bloat --example tiny_announce --profile=size-optimized`) showed that all the big functions left were from `std`.
|
||
Most of that looked like stack unwinding and debug data parsing, which is odd as we added `panic = 'abort'` in the first chapter.
|
||
|
||
As it turns out, as an optimization for the development workflow, by default cargo does not recompile the standard library.
|
||
Instead, a prebuilt version included in the toolchain is used.
|
||
The compiler arguments for that are fixed, and to change that and get more control over how `stdlib` is compiled and linked, we neeed the unstable option `-Zbuild-std` and have to list which sub-crates we want to build (which is pretty much all of them).
|
||
Because we also have `panic = "abort"` set, we need to also pass in `-Zbuild-std-features="panic_immediate_abort"` so there is no compilation error.
|
||
|
||
`cargo build --example tiny_announce --profile=size-optimized -Zbuild-std="core,std,alloc,proc_macro,panic_abort" -Zbuild-std-features="panic_immediate_abort"`
|
||
|
||
This produces a binary that is now only 30.992 bytes!
|
||
|
||
### In-between find: to_socket_addrs
|
||
|
||
The remaining top 3 functions were:
|
||
|
||
```
|
||
File .text Size Crate Name
|
||
4.4% 11.0% 2.0KiB std <&T as std::net::socket_addr::ToSocketAddrs>::to_socket_addrs
|
||
3.8% 9.4% 1.7KiB tiny_announce tiny_announce::main
|
||
2.9% 7.3% 1.4KiB [Unknown] main
|
||
```
|
||
|
||
Main shows up again! But what is that? 4.4% used by `to_socket_addrs`?
|
||
We found the last string parsing code, this time in the standard library, to read the IP and Port from a string.
|
||
After changing it in the example, it still showed up which brings me to the first and only time I actually changed the `servicepoint` library as a result from this saga.
|
||
|
||
```patch
|
||
- let socket = UdpSocket::bind("0.0.0.0:0")?;
|
||
+ let addr = SocketAddr::from(([0, 0, 0, 0], 0));
|
||
+ let socket = UdpSocket::bind(addr)?;
|
||
```
|
||
|
||
This also seemed to remove other functions as well, as the size was down to 17.272 bytes, nearly halving the size _again_.
|
||
It is now smaller than this article as plain text markdown.
|
||
|
||
### 4. no_main
|
||
|
||
You'd think that now `main` is the top function, but `Iter::next` is now the biggest function for some reason.
|
||
Still, `[Unknown] main` and the actual main take up 10% of the remaining size according to `cargo bloat`.
|
||
|
||
We surely cannot reduce that, right? Wrong!
|
||
With #[no_main], you can tell Rust to not add any initialization code.
|
||
This means the normal `fn main()` does not get used, and the linker complains about the missing function.
|
||
To fix this, the function can be converted to a C-style main.
|
||
|
||
I also removed some more code by initializing the CharGrid directly instead of wrapping a string, which saved 400 bytes.
|
||
|
||
```rust
|
||
#![no_main]
|
||
|
||
use servicepoint::{
|
||
CharGrid, CharGridCommand, ClearCommand, Connection, UdpConnection,
|
||
};
|
||
use std::net::SocketAddr;
|
||
|
||
#[unsafe(no_mangle)]
|
||
pub extern "C" fn main(_argc: isize, _argv: *const *const u8) -> isize {
|
||
// not parsing the address from str removes 3KB
|
||
let addr = SocketAddr::from(([172, 23, 42, 29], 80));
|
||
|
||
let connection = UdpConnection::open(addr).unwrap();
|
||
connection.send(ClearCommand).unwrap(); // <--
|
||
|
||
let grid = CharGrid::from_vec(5, vec!['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']).unwrap();
|
||
connection.send(CharGridCommand::from(grid)).unwrap();
|
||
0
|
||
}
|
||
```
|
||
|
||
This resulted in a 8.064 byte executable, finally beating both GCC and LLVM compiling the minimal C program (around 10KB).
|
||
If we were to remove the marked line and not clear the screen, we could drop it further to 7.696 bytes.
|
||
|
||
### Advanced compiler abuse
|
||
|
||
There are two things left to reach the absolute bottom without ripping out the standard libary alltogether.
|
||
|
||
In Rust, a function can tell the compiler to get the calling location as a parameter to the function.
|
||
With `-Zlocation-detail=none`, we instruct the Rust compiler to just not bother with that.
|
||
|
||
`-Zfmt-debug=none` is similar but worse, because it changes all the default `Debug` implementations to do nothing at all.
|
||
The change in behavior is not obvious in this example, but do this in an application that has logging and it will be horribly broken.
|
||
|
||
As an icing on the cake, those two options cannot be passed via `cargo` arguments, so we have to use the environment variable `RUSTFLAGS` to pass this through to when `rustc` is invoked.
|
||
|
||
The final command to build the tiniest possible announce in all it's glory:
|
||
|
||
```sh
|
||
RUSTFLAGS="-Zlocation-detail=none -Zfmt-debug=none" \
|
||
cargo build \
|
||
--example tiny_announce \
|
||
--profile=size-optimized \
|
||
--no-default-features \
|
||
--features=protocol_udp \
|
||
-Zbuild-std="core,std,alloc,proc_macro,panic_abort" \
|
||
-Zbuild-std-features="panic_immediate_abort"
|
||
```
|
||
|
||
All of this reduces the final binary size to 8.064 bytes.
|
||
|
||
## Conclusion
|
||
|
||
Through this journey, I've managed to reduce the size of the binary from an unwieldy 1.1 MB to an impressive 8 KB, without sacrificing all of the standard library.
|
||
For me the most unexpected was the size of `clap` code, though I learned dozens of things at every step about the intricacies of how `cargo build` produces your final binary.
|
||
|
||
There is no single option that in itself is the solution, it’s a matter of experimenting with a combination of compiler flags, feature toggles, and code optimizations.
|
||
While extreme options can be great if you want to squeeze out the last bytes, it's probably not worth using those in a "normal" computer scenario.
|
||
|
||
The key takeaway is that optimizing binary size in Rust, while not always straightforward, is achievable with the right techniques.
|
||
It is certainly easier to create a big binary than in C, calling Rust bloated is blaming the language a bit too much.
|
||
|
||
Stay tuned for part two, in which I will try to do something similar with a C version of the example, using the C bindungs of the `servicepoint` crate.
|
||
|
||
[^1]: Yes, I know UDP does not have connections. Internally, this just opens a UDP socket
|
||
[^panic-abort]: Technically, you can catch a panic while unwinding and there may even be a weird performance argument for doing that, see <!-- TODO find article about making serde faster with panic catching -->
|
||
[^2]: Some commands can be compressed, but the text ones (both CP-437 and UTF-8) cannot. Clear is a _very_ simple command that does not have any payload, so no compression there either. If a `BitmapCommand` was used instead, using `into()` on a `Bitmap` would have hidden the fact that the default compression is used in that case. The default compression in turn is either LZMA or no compression, depending on whether the LZMA feature is enabled.
|
||
[^3]: This works here because `announce` is an example inside of the library itself. As an actual dpendent, you would specify this in your `Cargo.toml`. |