NETX | Research

our research areas

Debugging Performance Issues in Microservice-based Distributed Application

Latency anomalies in remote procedural calls (RPC) among microservices are common. Particularly, latency anomalies due to packet-processing delays in the host network stack can shoot up end-to-end request completion time by tens of milliseconds. It takes many man-hours and expertise to debug the cause as there are many components (e.g., NIC, root namespace, container namespace) to blame, and the delay events at these components are transient and sporadic (e.g., Incast, CPU interference, burst). If not fixed quickly, delay events at a few components can compound, eventually leading to SLA violations on the completion time of many user requests. Application tracing tools like Jeager, though, provide great visibility into latency anomalies at the application layer but are not sufficient to localize delay events and associated components in the host network stack.

To diagnose RPC latency anomaly observed between a sender and a receiver, the first step is to localize the problem and then collect relevant traces for analysis. More specifically, localization involves finding whether the problem is at the host network stack or at other layers. If it is in the network stack, identify which component introducing packet-processing delays. After the localization, host-level traces collected using tools like perf, kprobes, and intel-pt, can be analyzed to find the cause.

Towards this, we design a system that automatically localizes the bottleneck component in the host network stack. Given an end-user application request’s trace and span from tools like Jaeger, the system extends them by providing the span of host-level network components. The system also detects abnormal delay events at a component whenever the observed processing time (span) takes longer than expected. The system is built on top of eBPF-based agents at various hook points in the host network stack, which captures timestamps at ingress and egress of each component, analyzes processing times, and detects bottleneck components. This approach keeps the resource overheads low (distributed) while accurately detecting abnormal events.explore

Securing Programmable Network Infrastructure

The recent trend of programmable network hardware (switches, SmartNICs) and open network software ecosystem (ONF) on top of it provide new opportunities to rethink fundamental questions in Internet security. Network programmability gives flexibility to program high-speed hardware-based switches and implement novel network functions, system control, and higher-level services. Moreover, we can also reconfigure the same hardware to meet changing requirements.

For instance, the same network device can be reconfigured to implement one or more network functions like L2 forwarding, L3 routing, load balancing, NATing, border gateways, metering, firewall, in-network DDoS detection etc. We argue this is the time to leverage the same network programmability capabilities and implement security features for programmable devices which can shape the next-generation Internet.

To date, systems using programmable hardware are deployed in edge and cloud environments and there are few proof of concepts implementations of core network functions. However, their primary focus is on performance, availability, and security of applications running in their environment with little attention paid to important security challenges in programmable network infrastructure.

In this context, we categorize the vulnerabilities of design choices made in these data-driven systems and explain how an attacker can exploit the vulnerabilities, especially by injecting crafted adversarial inputs.explore

Understanding eBPF Bytecode Network Function Behavior

Many cloud infrastructure organizations increasingly rely on third-party eBPF-based network functions (e.g., Cilium) for use cases like security, observability, and load balancer, so that not everyone requires a team of highly skilled eBPF experts. But the network functions from a third party (e.g., F5, Palo Alto) are available in bytecode format to cloud operators, giving little or no understanding of possible bugs (functional correctness), packet-processing behavior, interaction with various kernels, and interaction with other network functions. More specifically, some examples of debugging and data privacy-related queries before deploying the bytecode are: Does it bypass other eBPF programs at the same hookpoint (e.g., XDP)? Does it make an unintended modification to a header field that another already deployed eBPF bytecode relies on? Does it copy sensitive information that violates policies on data privacy?

Generally, eBPF programs are written in high-level languages (like C, Rust, Python, etc.) and compiled into eBPF bytecode. Each statement at a high level can translate into multiple bytecode instructions, making it challenging to interpret and understand the relation to other instructions in the execution flow. Additionally, due to the limited number of registers and stack memory, large and complex eBPF programs reuse registers and memory locations, leading to frequent redefinitions that further obscure their behavior.

We are working on a tool that allows network operators to write queries on an individual or a chain of eBPF-based network function bytecodes. The tool simplifies the complex details and analyzes bytecodes for a specific query. The insights provided by the tool helps operators to understand the packet-processing behaviour and prevent high-profile outages (e.g., Datadog’s 5 Million Dollar outage, Outages discussed in Meta’s NetEdit).explore

Performance-aware Load Balancer for Latency-Critical Edge Cloud Applications

Service meshes like Istio support client-side load balancing where each upstream microservice with a sidecar proxy (e.g., envoy) distributes the requests across the next hop downstream microservices. However, contention for a shared host, GPU, and network resources in the edge cloud leads to latency anomalies, which degrades the performance of latency-critical applications such as deep-learning-based video inference as a service for resource-constrained robots deployed on an edge cloud, orchestrated by Kubernetes-Istio. The load balancer should quickly adapt to dynamic delay events such as (1) high load on the servers due to the asymmetry between easy-to-generate requests and resource-intensive response (GPU processing), and (2) sporadic and transient network congestion on paths toward the downstream service instances due to bandwidth-intensive nature of the requests (e.g., video inference).

A load balancer that quickly adapts to dynamic events by routing requests to the best possible servers while ensuring that the compute and network resources are not underutilized improves the response times, especially tail latencies, for latency-critical applications.

Towards this, we work on a load balancer that shows faster adaptation to servers’ load and network congestion by rerouting requests to underloaded service instances with faster connectivity. The load-balance has visibility into two dynamic events: (1) load on each service instance and (2) network status (delays) on paths toward the service instances.explore

Validating Runtime Behavior of Programmable Networks

Network programmability has significantly increased the capabilities of both core networks and host networks. At the core network, one can specify the intended packet processing behavior in a program written in a domain-specific programming language like P4, and deploy it into the network devices. At the host network, using eBPF technology, one can add additional packet processing capabilities to the Linux kernel by deploying eBPF programs written in high-level languages like C, Python, and GO.

However, the ecosystem of programmable networks is increasingly becoming complex. Several components are involved in defining packet-processing behavior at the target device (switch or host). Some components are the program that captures the intended behavior (p4/eBPF), the compiler that translates high-level language programs to the target device language (p4c/clang-LLVM), the control plane that derives match-action rules at runtime (ONOS/Cilium), the software agent that configures the data plane at runtime (P4Runtime/eBPF maps).

Bugs in any one or more of these components introduce packet processing errors. Such bugs are difficult to detect (and mitigate) as they manifest themselves in any of the components either before or after the deployment of the program. But their presence will have a huge impact on the overall network performance. Our objective is to design and develop systems to detect the presence of such hard-to-catch bugs manifesting themselves at runtime.

The key idea is to capture the actual packet processing behavior at runtime and validate it with the expected behavior. But the main challenges that need to be addressed are (1) capturing actual packet-processing behavior on high-speed networks with minimal-to-no impact on packet-processing performance (latency, throughput); and (2) capturing, translating, and representing expected behavior (usually from developers) in a way that enables faster validation of every packet.explore

Building a Scalable and Secure IoT based Network

Data breaches and cyber-attacks involving Internet of Things (IoT) devices are becoming ever more concerning. Adversaries can exploit device vulnerabilities and launch network-based attacks that have serious negative implications for critical infrastructure. The heterogeneity of these devices and the sheer scale at which they are deployed make securing IoT devices highly challenging.

Existing security mechanisms either use off-the-path remote collectors to analyze IoT traffic or use specialized security middleboxes which are functionally rigid and costly to scale. With the staggering growth of IoT devices, it is imperative that a scalable and holistic security strategy be devised to address the security concerns of the IoT ecosystem.

The advent of programmable network devices and a language (P4) to specify packet processing behavior has enabled the development of closed-loop in-network systems that operate majorly in the data plane, thus leveraging line rate speeds. However, such systems are built on top of memory-constrained programmable data plane (PDP) hardware which limits their scalability. Additionally, such systems expose a larger attack surface at the data plane owing to the increased programmability.

Our objective is to leverage novel programmable data plane mechanisms to build a security primitive and a system around it that can scale to secure a large IoT network.explore