Latency anomalies in remote procedural calls (RPC) among microservices are common. Particularly, latency anomalies due to packet-processing delays in the host network stack can shoot up end-to-end request completion time by tens of milliseconds. It takes many man-hours and expertise to debug the cause as there are many components (e.g., NIC, root namespace, container namespace) to blame, and the delay events at these components are transient and sporadic (e.g., Incast, CPU interference, burst). If not fixed quickly, delay events at a few components can compound, eventually leading to SLA violations on the completion time of many user requests. Application tracing tools like Jeager, though, provide great visibility into latency anomalies at the application layer but are not sufficient to localize delay events and associated components in the host network stack.
To diagnose RPC latency anomaly observed between a sender and a receiver, the first step is to localize the problem and then collect relevant traces for analysis. More specifically, localization involves finding whether the problem is at the host network stack or at other layers. If it is in the network stack, identify which component introducing packet-processing delays. After the localization, host-level traces collected using tools like perf, kprobes, and intel-pt, can be analyzed to find the cause.
The recent trend of programmable network hardware (switches, SmartNICs) and open network software ecosystem (ONF) on top of it provide new opportunities to rethink fundamental questions in Internet security. Network programmability gives flexibility to program high-speed hardware-based switches and implement novel network functions, system control, and higher-level services. Moreover, we can also reconfigure the same hardware to meet changing requirements.
For instance, the same network device can be reconfigured to implement one or more network functions like L2 forwarding, L3 routing, load balancing, NATing, border gateways, metering, firewall, in-network DDoS detection etc. We argue this is the time to leverage the same network programmability capabilities and implement security features for programmable devices which can shape the next-generation Internet.
To date, systems using programmable hardware are deployed in edge and cloud environments and there are few proof of concepts implementations of core network functions. However, their primary focus is on performance, availability, and security of applications running in their environment with little attention paid to important security challenges in programmable network infrastructure.
Many cloud infrastructure organizations increasingly rely on third-party eBPF-based network functions (e.g., Cilium) for use cases like security, observability, and load balancer, so that not everyone requires a team of highly skilled eBPF experts. But the network functions from a third party (e.g., F5, Palo Alto) are available in bytecode format to cloud operators, giving little or no understanding of possible bugs (functional correctness), packet-processing behavior, interaction with various kernels, and interaction with other network functions. More specifically, some examples of debugging and data privacy-related queries before deploying the bytecode are: Does it bypass other eBPF programs at the same hookpoint (e.g., XDP)? Does it make an unintended modification to a header field that another already deployed eBPF bytecode relies on? Does it copy sensitive information that violates policies on data privacy?
Generally, eBPF programs are written in high-level languages (like C, Rust, Python, etc.) and compiled into eBPF bytecode. Each statement at a high level can translate into multiple bytecode instructions, making it challenging to interpret and understand the relation to other instructions in the execution flow. Additionally, due to the limited number of registers and stack memory, large and complex eBPF programs reuse registers and memory locations, leading to frequent redefinitions that further obscure their behavior.
Service meshes like Istio support client-side load balancing where each upstream microservice with a sidecar proxy (e.g., envoy) distributes the requests across the next hop downstream microservices. However, contention for a shared host, GPU, and network resources in the edge cloud leads to latency anomalies, which degrades the performance of latency-critical applications such as deep-learning-based video inference as a service for resource-constrained robots deployed on an edge cloud, orchestrated by Kubernetes-Istio. The load balancer should quickly adapt to dynamic delay events such as (1) high load on the servers due to the asymmetry between easy-to-generate requests and resource-intensive response (GPU processing), and (2) sporadic and transient network congestion on paths toward the downstream service instances due to bandwidth-intensive nature of the requests (e.g., video inference).
A load balancer that quickly adapts to dynamic events by routing requests to the best possible servers while ensuring that the compute and network resources are not underutilized improves the response times, especially tail latencies, for latency-critical applications.
Network programmability has significantly increased the capabilities of both core networks and host networks. At the core network, one can specify the intended packet processing behavior in a program written in a domain-specific programming language like P4, and deploy it into the network devices. At the host network, using eBPF technology, one can add additional packet processing capabilities to the Linux kernel by deploying eBPF programs written in high-level languages like C, Python, and GO.
However, the ecosystem of programmable networks is increasingly becoming complex. Several components are involved in defining packet-processing behavior at the target device (switch or host). Some components are the program that captures the intended behavior (p4/eBPF), the compiler that translates high-level language programs to the target device language (p4c/clang-LLVM), the control plane that derives match-action rules at runtime (ONOS/Cilium), the software agent that configures the data plane at runtime (P4Runtime/eBPF maps).
Bugs in any one or more of these components introduce packet processing errors. Such bugs are difficult to detect (and mitigate) as they manifest themselves in any of the components either before or after the deployment of the program. But their presence will have a huge impact on the overall network performance. Our objective is to design and develop systems to detect the presence of such hard-to-catch bugs manifesting themselves at runtime.
Data breaches and cyber-attacks involving Internet of Things (IoT) devices are becoming ever more concerning. Adversaries can exploit device vulnerabilities and launch network-based attacks that have serious negative implications for critical infrastructure. The heterogeneity of these devices and the sheer scale at which they are deployed make securing IoT devices highly challenging.
Existing security mechanisms either use off-the-path remote collectors to analyze IoT traffic or use specialized security middleboxes which are functionally rigid and costly to scale. With the staggering growth of IoT devices, it is imperative that a scalable and holistic security strategy be devised to address the security concerns of the IoT ecosystem.
The advent of programmable network devices and a language (P4) to specify packet processing behavior has enabled the development of closed-loop in-network systems that operate majorly in the data plane, thus leveraging line rate speeds. However, such systems are built on top of memory-constrained programmable data plane (PDP) hardware which limits their scalability. Additionally, such systems expose a larger attack surface at the data plane owing to the increased programmability.