wip rust tiny binaries

This commit is contained in:
Vinzenz Schroeter 2025-04-09 22:21:41 +02:00
parent caadf1fb58
commit d5e0d8c84e

View file

@ -5,6 +5,8 @@ title = 'Debloating your rust binary'
tags = ['rust', 'servicepoint']
+++
<!-- TODO: is it rust or Rust -->
In [CCC Berlin](https://berlin.ccc.de/), there is a big pixel matrix hanging on the wall that we call "ServicePoint display".
Anyone in the local network can send UDP packets containing commands that the display will execute.
The commands are sent in a binary data structure and contain things like very basic text rendering and overwriting parts of the pixel buffer.
@ -18,10 +20,12 @@ Thus, I was immediatedly nerd-sniped and I could not think about anything else i
I _had_ to find out why it was so big, and there would _have_ to be a way to fix it.
This is part one, where I optimize the core library for size.
The order in which I tried all the options is changed for a better text structure, but the results are re-created in the order they appear using the stated tools.
In a future post, I also want to document how I got the C bindings smaller, as those use all features by default.
There are also probably some additional challenges like ABI for shared libraries worth facing.
Most of what I cover here is descibed in [Minimizing Rust Binary Size](https://github.com/johnthagen/min-sized-rust), though I hope the specific example I provide makes the topic more interesting.
Most of the techniques I used are descibed in [Minimizing Rust Binary Size](https://github.com/johnthagen/min-sized-rust), though I hope the specific example I provide makes the topic interesting to readers not writing rust code.
Let's get hacking!
## Starting point
@ -42,22 +46,14 @@ use servicepoint::{
/// [2]
#[derive(Parser, Debug)]
struct Cli {
#[arg(
short,
long,
default_value = "localhost:2342",
help = "Address of the display"
)]
#[arg(short, long, default_value = "localhost:2342",
help = "Address of the display")]
destination: String,
#[arg(short, long, num_args = 1.., value_delimiter = '\n',
help = "Text to send - specify multiple times for multiple lines")]
text: Vec<String>,
#[arg(
short,
long,
default_value_t = true,
help = "Clear screen before sending text"
)]
#[arg(short, long, default_value_t = true,
help = "Clear screen before sending text")]
clear: bool,
}
@ -104,6 +100,8 @@ The resulting size was 1.1 MB, which should be easy enough to beat.
## Low hanging fruits
### Compiler options
The first thing that came to mind was `-Os`, so compiling for binary size. The rust equivalent is `opt-level = "s"`, or `z` to also disable loop vectorization.
| Option | size in isolation (change) | size cumulative (change) |
@ -118,6 +116,7 @@ The first thing that came to mind was `-Os`, so compiling for binary size. The r
| switching back to opt-level = 'z' | | 555.480 |
So it turns out, if you want to halve your binary size, a few flags are enough in stable rust. Also, the combinations of those settings do not work linearly, and sometimes what resulted in a smaller binary before now increased the size.
The only compromise apart from compilation time is the change in panic behavior, as this means no stack traces on crash[^panic-abort].
To only compile like this in specific szenarios, you can add a new profile to a crates `Cargo.toml` like this:
@ -134,13 +133,45 @@ strip = true # Strip symbols from binary
The profile can be used by passing `--profile=size-optimized` instead of `--release` to `cargo build`.
Because of the different profile, the binary ends up in a different folder (`ll -B target/size-optimized/examples` to check size).
### Features
Rust has a very handy way to represent variability in a library called features.
The `servicepoint` library has the following declaration in it's `Cargo.toml`:
```toml
[features]
default = ["compression_lzma", "protocol_udp", "cp437"]
compression_zlib = ["dep:flate2"]
compression_bzip2 = ["dep:bzip2"]
compression_lzma = ["dep:rust-lzma"]
compression_zstd = ["dep:zstd"]
all_compressions = ["compression_zlib", "compression_bzip2", "compression_lzma", "compression_zstd"]
rand = ["dep:rand"]
protocol_udp = []
protocol_websocket = ["dep:tungstenite"]
cp437 = ["dep:once_cell"]
```
Line two means by default, cargo will enable LZMA compression, sending via UDP sockets and conversion between CP-437 and UTF-8.
Each of those features pulls in an optional dependency (which is why I made those features toggleable in the first place).
In the code, CP-437 and compression are not needed[^2], but UDP is obviously used.
Features can be toggled on the command line, which means the invocation can be changed to the following: `cargo build --example announce --profile=size-optimized --no-default-features --features=protocol_udp`[^3].
The result is a 555.480 Byte binary, which is exactly the same as without those flags.
This is not really surprising, as we enabled a bunch of compiler options that help remove whole sections of code that are not needed, especially link time optimization.
In the rest of this post, I will omit those parameters, probably to the detriment of compilation time.
## Digging deeper
While this was a big improvement already, this was still 50 times the size of the C program.
_If it was this easy halving it, can I do that a second time?_
Everything from here on required unstable features of the rust [flake for RedoxOS-development](https://gitlab.redox-os.org/redox-os/redox/-/blob/cb34b9bd862f46729c0082c37a41782a3b1319c3/flake.nix#L38). The version I ended up with was `rustc 1.88.0-nightly (5e17a2a91 2025-04-05)`. The executables I got with the unstable version were already a bit smaller again (546.528 bytes).
Everything from here on required unstable features of the rust [flake for RedoxOS-development](https://gitlab.redox-os.org/redox-os/redox/-/blob/cb34b9bd862f46729c0082c37a41782a3b1319c3/flake.nix#L38). The version I ended up with was `rustc 1.88.0-nightly (5e17a2a91 2025-04-05)`.
In my environment, I had to call nightly cargo with `rustup run nightly cargo`, but that part is not included in the rest of the commands.
The executables I got with the unstable version were already a bit smaller again (546.528 bytes).
The first thing I noticed was that I got some new warnings when compiling, all of which I fixed immediately. As it was mostly inside of the documentation, I did not expect this to affect file size.
@ -173,6 +204,193 @@ File .text Size Crate Name
17.5% 100.0% 384.2KiB .text section size, the file size is 2.1MiB
```
Wait what? Why is the binary 2.1MB now? `ll -B target/size-optimized/examples `
From the table, we can already see some interesting stuff.
1. For some reason, the `.text` section (the machine code) is only a small part of the executable, and the total size increased by a factor of 4.
2. The biggest function and a bunch of other big ones are from `clap_builder`, a crate that is part of the command line argument parser.
3. `std` thakes up most of the rest.
4.
5. `servicepoint` does not even show up in the top list.
[^1]: Yes, I know UDP does not have connections. Internally, this just opens a UDP socket.
Let's cover those points in order.
### 1. Unexpected binary size when building via cargo-bloat
Using `GNU size`, we can check the size per section in the ELF binary.
Using `-G` or `-B` output formats does not work for this, as it will only show the `.text` and `.data` section, which in this case only make up around 500KB.
Thus the command I used was ` size -A --common target/size-optimized/examples/announce`, giving the following result:
```
section size addr
.dynsym 1680 856
.dynstr 1198 3500
.rela.dyn 22800 4704
.gcc_except_table 3728 27552
.rodata 36592 31280
.eh_frame_hdr 8116 67872
.eh_frame 52488 75992
.text 393449 132576
.data.rel.ro 18760 530248
.relro_padding 2888 550072
.data 2400 554168
.debug_abbrev 1810 0
.debug_info 525404 0
.debug_aranges 6256 0
.debug_ranges 157856 0
.debug_str 726991 0
.debug_line 149936 0
Total 2115147
(I filtered out the rows <1KB for brevity)
```
Turns out `cargo-bloat` disables symbol stripping, because it needs those to show to the user.
It's not even the symbols that are included in release builds by default - _all_ the debugging information is included.
That means, I can ignore that problem and focus on the `.text` size.
### 2. Removing clap
While clap is super handy, it looks like the code needed to parse two simple arguments blows up the executable.
That's probably a mix of complex parsing logic with error handling, constant strings and data-dependent code paths the compiler cannot detect as not being used.
As the C program I was comparing against had all the parameters hard-coded, I just ripped out the dependency and hard-coded the values I needed.
The result is the first version of `tiny_announce`, as I did not want to change the existing example.
```
//! An example for how to send text to the display.
use servicepoint::{
CharGrid, CharGridCommand, ClearCommand, Connection, UdpConnection,
TILE_WIDTH,
};
/// example: `cargo run -- --text "Hallo" --text "CCCB"`
fn main() {
let text = "Hello, CCCB!";
let connection = UdpConnection::open("127.0.0.1:2342")
.expect("could not connect to display");
connection.send(ClearCommand).expect("sending clear failed");
let command: CharGridCommand = CharGrid::wrap_str(TILE_WIDTH, &text).into();
connection.send(command).expect("sending text failed");
}
```
The command to compile changed slightly because of the new name. `cargo build --example tiny_announce --profile=size-optimized && ll -B target/size-optimized/examples/tiny_announce` gave me the new binary size.
Drumroll... 324.624 Bytes!
40% of the binary was argument parsing.
This also makes the main disappear from the top sized functions.
While removing a library you do not really need is also available in stable rust, I was only able to notice that with tooling only available on nightly, so I am putting it into that category.
### 3. build-std
Looking at the biggest functions again (now `cargo bloat --example tiny_announce --profile=size-optimized`) showed that all the big functions left were from `std`.
Most of that looked like stack unwinding and debug data parsing, which is odd as we added `panic = 'abort'` in the first chapter.
As it turns out, as an optimization for the development workflow, by default cargo does not recompile the standard library.
Instead, a prebuilt version included in the toolchain is used.
This is possible, because the compiler knows `stdlib` is compiled with the exact same version as the user's program is, otherwise the missing ABI stability in Rust comes into play.
The compiler arguments for that are fixed, and to change that we neeed the unstable option `-Zbuild-std` and have to list which sub-crates we want to build (which is pretty much all of them).
Because we also have `panic = "abort"` set, we need to also pass in `-Zbuild-std-features="panic_immediate_abort"` so there is no compilation error.
`cargo build --example tiny_announce --profile=size-optimized -Zbuild-std="core,std,alloc,proc_macro,panic_abort" -Zbuild-std-features="panic_immediate_abort"`
This produces a binary that is now only 30.992 bytes!
### to_socket_addrs
The remaining top 3 functions were:
```
File .text Size Crate Name
4.4% 11.0% 2.0KiB std <&T as std::net::socket_addr::ToSocketAddrs>::to_socket_addrs
3.8% 9.4% 1.7KiB tiny_announce tiny_announce::main
2.9% 7.3% 1.4KiB [Unknown] main
```
Finally our code shows up again! But what is that? 4.4% used by `to_socket_addrs`?
We found the last string parsing code, this time in the standard library, to read the IP and Port from a string.
After changing it in the example, it still showed up which brings me to the first and only time I actually changed the `servicepoint` library as a result from this saga.
```patch
- let socket = UdpSocket::bind("0.0.0.0:0")?;
+ let addr = SocketAddr::from(([0, 0, 0, 0], 0));
+ let socket = UdpSocket::bind(addr)?;
```
This also seemed to remove other functions as well, as the size was down to 17.272 bytes, nearly halving the size _again_.
It is now smaller than this article as plain text markdown.
### no_main
You'd think that now `main` is the top function, but `Iter::next` is now the biggest function for some reason.
Still, `[Unknown] main` and the actual main take up 10% of the remaining size according to `cargo bloat`.
We surely cannot reduce that, right? Wrong!
With #[no_main], you can tell rust to not add any initialization code.
This means the normal `fn main()` does not get used, and the linker complains about the missing function.
To fix this, the function can be converted to a C-style main.
I also removed some more code by initializing the CharGrid directly instead of wrapping a string, which saved 400 bytes.
```rust
#![no_main]
use servicepoint::{
CharGrid, CharGridCommand, ClearCommand, Connection, UdpConnection,
};
use std::net::SocketAddr;
#[unsafe(no_mangle)]
pub extern "C" fn main(_argc: isize, _argv: *const *const u8) -> isize {
// not parsing the address from str removes 3KB
let addr = SocketAddr::from(([172, 23, 42, 29], 80));
let connection = UdpConnection::open(addr).unwrap();
connection.send(ClearCommand).unwrap(); // <--
let grid = CharGrid::from_vec(5, vec!['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']).unwrap();
connection.send(CharGridCommand::from(grid)).unwrap();
0
}
```
This resulted in a 8.064 byte executable, finally beating both GCC and LLVM compiling the minimal C program (around 10KB).
If we were to remove the marked line and not clear the screen, we could drop it further to 7.696 bytes.
### Advanced compiler abuse
There are two things left to reach the absolute bottom without ripping out the standard libary alltogether.
In rust, a function can tell the compiler to get the calling location as a parameter to the function.
With `-Zlocation-detail=none`, we instruct the rust compiler to just not bother with that.
`-Zfmt-debug=none` is similar but worse, because it changes all the default `Debug` implementations to do nothing at all.
The change in behavior is not obvious in this example, but do this in an application that has logging and it will be horribly broken.
As an icing on the cake, those two options cannot be passed via `cargo` arguments, so we have to use the environment variable `RUSTFLAGS` to pass this through to when `rustc` is invoked.
The final command to build the tiniest possible announce in all it's glory:
```sh
RUSTFLAGS="-Zlocation-detail=none -Zfmt-debug=none" \
cargo build \
--example tiny_announce \
--profile=size-optimized \
--no-default-features \
--features=protocol_udp \
-Zbuild-std="core,std,alloc,proc_macro,panic_abort" \
-Zbuild-std-features="panic_immediate_abort"
```
All of this reduces the final binary size to 7.696 bytes.
## Conclusion
<!-- TODO -->
[^1]: Yes, I know UDP does not have connections. Internally, this just opens a UDP socket
[^panic-abort]: Technically, you can catch a panic while unwinding and there may even be a weird performance argument for doing that, see <!-- TODO find article about making serde faster with panic catching -->
[^2]: Some commands can be compressed, but the text ones (both CP-437 and UTF-8) cannot. Clear is a _very_ simple command that does not have any payload, so no compression there either. If a `BitmapCommand` was used instead, using `into()` on a `Bitmap` would have hidden the fact that the default compression is used in that case. The default compression in turn is either LZMA or no compression, depending on whether the LZMA feature is enabled.
[^3]: This works here because `announce` is an example inside of the library itself. As an actual dpendent, you would specify this in your `Cargo.toml`.