Peer-to-peer updates for edge compute nodes

Low-cost, low-power edge compute devices and nodes are key components of Internet of Thing (IoT) systems that are embedded in smart homes and smart cities. They generally start small but can rapidly scale to many thousands of nodes. Devices can be inaccessible, mobile, or in private residential locations, so remote administration is essential to deploy updates and install new applications. This begs the question — how can we effectively manage and update such devices?

There are a number of DevOps tools available, all of which tend to follow one of two patterns:

The push model is effective in data centres, with devices connected via a high-performance, professionally-administered network to ensure they’re reachable from the management server. However, it’s much less effective when devices have intermittent connectivity and are located behind Network Address Translations (NATs), firewalls, or other middleboxes that limit reachability — especially where there’s no system administrator to correct connectivity problems.

The pull model addresses some of these issues but, in turn, introduces scalability problems and a single point of failure — unless a Content Distribution Network (CDN) or other caching solution is used — introducing additional cost and complexity.

If we are to update devices for a reasonable cost, irrespective of the scale or heterogeneity of the system, we must develop more robust, scalable, and decentralized tools for cluster management.

Tools for cluster management: what’s required

In our paper, Peer to Peer Secure Update for Heterogeneous Edge Devices (presented at the proceedings of the IEEE/IFIP International Workshop on Decentralized Orchestration and Management of Distributed Heterogeneous Things), we assert that the components of such tools need to include connectivity discovery and NAT traversal, overlays, and peer-to-peer updates.

Systems deployed in arbitrary edge networks must include robust connectivity discovery and NAT traversal, to ensure they can communicate with the outside world. Edge networks generally don’t accept arbitrary incoming connections, and often extensively filter outbound traffic. Management systems running in such environments must systematically probe for external connectivity and discover NAT bindings using multiple techniques, including the ICE algorithm for systematic STUN-based UDP hole punching, TURN relays or other indirect paths, or tunnelling over UDP, TCP, HTTPS, and Web Sockets.

These systems need to also probe local connectivity, using techniques such as multicast DNS service discovery, since there could be devices behind the same edge firewall that are only indirectly reachable from the wider network. Protocols such as Universal Plug and Play (UPnP) can also help discover topology and connectivity.

Discovering devices and pathways to connectivity is key — many tools have been developed in this space but have not been systematically used for DevOps. Once devices have been discovered and paths to connectivity found, an overlay can be built.

Building an overlay

The primary goal of building an overlay is connectivity, not performance; with the intent of reaching all devices irrespective of how they’re connected, directly reachable or not, and regardless of the presence of NATs or other middleboxes.

The service should be similar to that of HashiCorp’s Serf: an open source gossip protocol pushing update notifications and simple configuration changes, without tracking membership or failures. Scaling such a service requires managing devices when they’re available, rather than tracking and updating them synchronously.

Finally, an existing peer-to-peer swarming protocol, for example, BitTorrent, is needed to download and install larger updates. In our study, we augmented the torrent files with public key signatures, UUID, and versioning information to ensure download authenticity. BitTorrent hashes the content to ensure integrity.

Management of edge compute nodes has significantly different challenges than managing data centre nodes

The peer-to-peer management system that we developed according to the above principles is currently being integrated into the FRμIT testbed — an experimental federated edge compute testbed built using Raspberry Pi nodes &dmash; where an early prototype is available for download.

The key lesson we learned from this testbed is that management of edge compute nodes has significantly different challenges than managing data centre nodes, and that existing tools are insufficiently robust or scalable.

DevOps tools need to be able to manage devices that are not directly reachable, at the time they become available. And if management is to scale in a cost-effective manner, we must give up on precise control and tracking of devices.

Many challenges remain, but we believe that peer-to-peer updates are an essential part of future management tools, as the only way to get effective and scalable connectivity.

For further details, see our paper.

This is a reprint of an original post to the APNIC blog.

Opinions expressed are my own, and do not represent those of my employers or the organisations that fund my research.