Growing your lab

Assumptions

At this point, it is expected that you will have a simple lab setup with some virtual devices and some simple test boards, each regularly passing health checks. It is generally a good idea to allow time for your lab to settle. Run a number of test jobs and understand the administrative burden before trying to expand your setup.

Once you are happy that things are working and you know how to run that simple lab, here are suggested steps to follow to grow it.

Requirements

When planning your lab, keep in mind the basic requirements of automation and LAVA:

  • Instances need high reliability.
  • Devices under test may often be prototypes or developer editions of hardware. This can lead to reliability and stability issues.
  • Extra hardware may be needed for automation which is not commonly found in generic hosting locations like data centres.
  • Depending on the type of hardware in use, it may have significant problems with automation. Some devices may need security measures disabling, for example.
  • Test jobs need to run quickly to provide useful results to developers.
  • Lab infrastructure needs to remain secure (some resources may not be open to the public).

Logical layouts

The simplest LAVA instance is a single master with a single worker on the same machine. Adding more devices to such an instance may quickly cause problems with load. Test jobs may time out on downloads or decompression and devices will go offline.

The first step in growing a LAVA lab is to add a remote worker. Remote workers can be added to any V2 master. To do so, use the Django administration interface to add new devices and device types, allocate some or all devices to the newly created remote worker and optionally configure encrpytion on the master and on the worker for secure communication between them.

As load increases, the master will typically benefit from having fewer and fewer devices directly attached to the worker running on the master machine. Complex labs will typically only have devices attached to remote workers.

Depending on the workload and admin preferences, there are several lab layouts that can make sense:

Single master, single worker

single master, single worker

This is the starting layout for a fresh installation. Depending on the capability of the master, this layout can support a small variety of devices and a small number of users. This layout does not scale well. Adding too many devices or users to this setup can lead to the highest overall maintenance burden, per test job, of all the layouts here.

In all of these example diagrams, Infrastructure represents the extra equipment that might be used alongside the LAVA master and workers, such as mirrors, caching proxies etc.

Single master, multiple workers

single master, multiple workers

A medium to large lab can operate well with a single master controlling multiple workers, especially if the master is a dedicated server running only lava-server.

Multiple masters, multiple workers

multiple masters, multiple workers

A custom frontend can use custom result handling to aggregate data from multiple separate masters into a single data set. The different masters can be geographically separated and run by different admins. This is the system used to great effect by KernelCI.org.

Micro-instances

micro-instance layout

When different teams need different sets of device types and configurations and where there is little overlap between the result sets for each team, a micro-instance layout may make sense.

The original single lab is split into separate networks, each with a separate complete instance of a LAVA master and one or more workers. This will give each team their own dedicated micro-instance, but the administrators of the lab can use common infrastructure just like a single lab in a single location. Each micro-instance can be grown in a similar way to any other instance, by adding more devices and more workers.

Which layout is best?

The optimum configuration will depend massively on the devices and test jobs that you expect to run. Use the multiple masters, multiple workers option where all test jobs feed into a single data set. Use micro-instances where teams have discrete sets of results. Any combination of micro-instances can still be aggregated behind one or more custom frontends to get different overviews of the results.

As an example, the Linaro LAVA lab in Cambridge is a hybrid setup. It operates using a set of micro-instances, some of which provide results to frontends like KernelCI.org.

Recommendations

  • Some labs have found it beneficial to have identical machines serving as the workers, in identical racks. This makes administration of a large lab much easier. It can also be beneficial to take this one stage further and have a similar, if not identical, set of devices on each worker. If your lab has a wide range of test job submissions which cover most device types, you may find that a similar layout helps balance the load.

  • Consider local mirroring or caching of resources such as NFS rootfs tarballs, kernel images, compressed images and git repositories. It is valuable to make downloads to the worker as quick as possible - slow downloads will inflate the run time of every test.

    • One of the administrative problems of CI is that these images change frequently, so a caching proxy may be more effective than a direct mirror of the build system storage.

    • Conversely, the use of https:// URLs inside test jobs typically will make caches and proxies much less effective. Not supporting https:// access to git repositories or build system storage can have implications for the physical layout of the lab, depending on local policy.

    • Depending on the lab, local mirroring of one or more distribution package archives can also be useful.

      Note

      This may rely on the build system for NFS rootfs and other deployments being configured to always use the local mirror in those images. This can then have implications for test writers trying to debug failed test jobs without access to the mirror.

  • Consider the implications of persistence. LAVA does not (currently) archive old test jobs, log files or results. The longer a single master is collating the results from multiple workers, the larger the dataset on that master becomes. This can have implications for the time required to perform backups, extract results or run database migrations during upgrades.

  • Consider reliability concerns - each site should have UPS support. Some sites may need generators as well. This is not just needed for the master and workers: it will also be required for all the devices, the network switches and and all your other lab infrastructure.

  • Devices in LAVA always need to remain in a state which can be automated. This may add lots of extra requirements: custom hardware, extra cabling and other support devices not commonly found in general hosting locations. This also means that LAVA is not suitable for customer-facing testing, debugging or triage.

Physical layouts

Important

If the master and one or more of the workers are to be connected across the internet instead of within a locally managed subnet, encrpytion on the master and on all workers is strongly recommended.

LAVA V2 supports geographically separate masters and workers. Workers can be protected behind a firewall or even using a NAT internet connection, without the need to use dynamic DNS or other services. Connections are made from the worker to the master, so the only requirement is that the ZMQ ports configured on the master are open to the internet and therefore use encryption.

Physically separating different workers is also possible but has implications:

  • Resources need to be mirrored, cached or proxied to multiple locations.
  • The administrative burden of a LAVA lab is frequently based around the devices themselves. LAVA devices frequently require a range of support tasks which are unsuitable for generic hosting locations. It is common that a trained admin will need physical access to test device hardware to fix problems. The latency involved in getting someone to the location of the device to change a microSD card, press buttons on a problematic device, investigate PDU failures and other admin tasks will have a large impact on the performance of the LAVA lab itself.
  • Physical separation across different sites can mean that test writers may see varying performance according to which worker has idle devices at the time. If one worker has a slower connection to the build system storage, test writers will need to allow for this in the job submission timeouts, possibly causing jobs on faster workers to spend longer waiting for the timeout to expire.
  • Each location still needs UPS support, backup support and other common lab infrastructure as laid out previously.

Resources

The Linaro lab in Cambridge has provided most of the real-world experience used to construct this guide. If you are looking for guidance about how to grow your lab, please talk to us on the lava-devel mailing list.