Posted by Chris Wailes – Senior Software Engineer
The performance, security and developer productivity offered by Rust has led to rapid adoption on the Android platform. Since slow build times are a concern when using Rust, especially in a project as large as Android, we’ve worked to ship the fastest version of the Rust Toolchain we can. To do this, we use a number of profiling and optimization methods, as well as C/C++, linker and Rust flags. Most of the public releases of the Rust toolchain I’m describing are similar to the build process, but tailored to the specific needs of the Android codebase. I hope this post is generally informative and can make your life easier if you are a rusty tool chain keeper.
Although Android is not unique in its need for a competent cross-platform toolchain, this fact combined with the many daily Android build calls means that we must carefully balance the time it takes to build a toolchain with the size of the toolchain. , and the resulting reinforcement performance.
Our construction process
To be clear, the optimization versions listed below are also available Rust obtained using rustup. Apart from the provenance, the Android toolchain differs from the official releases in the set targets and codebase used for profiling. All performance numbers listed below are the time it takes to build the Rust components of the Android image and may not reflect the speed of our other codebases as we compile them with our toolchain.
Codegen Units (CGU1)
When Rust compiles a box, it breaks it into some code generation units. Each independent code is developed and modified simultaneously and then recombined later. This approach allows LLVM to run each part of the code generation separately and improves compilation time, but may decrease the performance of the generated code. Some of this performance can be regained using Link Time Optimization (LTO), but this does not guarantee the same performance as if the box were compiled into a single coding unit.
We add as many optimization possibilities as possible and ensure reproducible builds -C codegen-units=1 Option to RUSTFLGS Environmental variable. This reduces the size of the toolchain by ~5.5% and increases performance by ~1.8%.
Note that setting this option reduces the time it takes to build the toolchain by ~2x (measured by our workstation).
Many projects, including the Rust Toolchain, have functions, classes, or entire namespaces that are not required under certain circumstances. The safest and easiest option is to leave these code items in the final product. This increases code size and may decrease performance (due to caching and layout issues), but it should not result in incorrect or mislinked binaries.
But it is possible. To ask the linker to remove code items not mentioned in transit main()Using the function –gc-sections Link argument. The link can only be class-specific, so if anything in the class is referenced, the entire class must be preserved. That is why passing is common -ffunction-components And – Trusted-parts Options to compiler or code generation backend. This ensures that each code object is assigned an independent class, so it allows linker garbage collection to collect objects individually.
This was one of the first improvements we implemented, and at the time it resulted in significant savings (on the order of 100s of MB). However, most of these findings have been overshadowed by manipulation -C codegen-units=1 When used in combination and now there is no difference between the size or performance of the two manufactured tool chains. However, due to the additional cost, we do not always use CGU1 when building the toolchain. When checking for accuracy, the final speed of the compiler is not very important, and as such, we allow the toolchain to be built with the default codegen units. In these cases, we’ll make some performance and scalability gains at a much lower cost when connecting a class GC.
Link-Time Optimization (LTO)
An administrator can only modify functions and data that he can see. Building a library or implementing it from independent object files or libraries speeds up compilation, but at the cost of optimization that only depends on the information available when the final binary is compiled. Link-Time Optimization (Link-Time Optimization) gives the compiler another chance to analyze and adjust the binary at link time.
We develop a thin LTO for the Android Rust toolchain. Both on the C++ code in LLVM and on the Rust code run by the Rust compiler and tools. Because it is IR released by us. Klang It may be a different version than the one released by IR. Rust We cannot perform cross-language LTO or reverse link. libLLVM. The performance gain achieved using the LTO optimized shared library is greater than using a non-LTO optimized static library, so we chose to use the shared linker.
Using CGU1, GC units and LTO produces ~7.7% speed and ~5.4% size improvement over baseline. This pipeline runs ~6% faster than the previous level due to LTO.
Profile Driven Optimization (PGO)
Command line arguments, environment variables, and the contents of files can affect how a program works. Some code blocks can be used repeatedly, while other branches and functions can only be used when an error occurs. By profiling the application as it runs, we can gather information about how often these blocks of code are executed. This data can be used to guide optimization when recompiling the program.
We use instrumented binaries to compile profiles both from building the Rust Toolchain itself and from building the Rust components of Android images. x86_64, arch64And riscv64. These four profiles are then combined and the toolchain is assembled with profile-driven updates.
As a result, the tool chain reduces the speed by ~19.8% and the size by 5.3% compared to the basic reinforcement. This is a 13.2% speedup over the previous step in the processor.
BOLT: Binary optimization and positioning tool
Even if LTO is enabled, the link still controls the final binary position. Since it is not driven by any profile information, the linker may place a frequently (hot) activity next to an infrequent (cold) activity. When a hot function is called later, all functions are loaded into the same memory page. The cold functions are now taking up space that could be allocated to other hot functions, thus forcing more pages containing these functions to load.
BOLT alleviates this problem by using additional location-oriented profile information to reorganize tasks and information. For acceleration purposes Rust We profiled. libLLVM, libstdAnd librustc_driver, which are the main dependencies of the compiler. These libraries are then BOLT optimized using the following options.
Any additional libraries related lib/*.so They are only optimized without profiles. –peepholes=all.
Applying BOLT to our toolchain produces a speedup of ~24.7% with a ~10.9% increase over baseline consolidation. This is ~6.1% faster than the PGOed compiler without BOLT.
If you’re interested in using BOLT in your own project/build, I’d give these two tips: 1) You need to release more migration information into your binaries. – Wall,–emit-relocates link argument and 2) use the same input library when calling BOLT to produce instrumented and optimized versions.
By compiling as a single code generation unit, collecting our data objects, making link-time and profile-driven optimizations, and using the BOLT tool, we were able to speed up the time it took to compile Android Rust components to 24.8. % Every day for 50k Android builds we save ~10k hours of continuous execution with our CI infrastructure.
Our industry does not stand still and there will undoubtedly be another device and another set of profiles in the near future. Until then, we will continue to make further improvements in search of further performance. Happy coding!