Datacenter maintenance is a time-intensive burden involving repetitive and tedious operational tasks. It can be expensive and often diverts researchers away from core R&D work. Existing autonomous management tools typically rely on cloud-hosted LLMs, exposing sensitive infrastructure configurations and telemetry data to third-party services — an unacceptable tradeoff for secure or proprietary environments.
This project builds a local, agentic infrastructure management system powered by domain-specific small language models (SLMs). The system provides a natural language interface for configuring, debugging, and monitoring datacenter and network infrastructure, with all data remaining on-premises.
An event-driven agentic framework enables autonomous responses to infrastructure events, preserving institutional knowledge and reducing manual intervention for routine operational tasks. The architecture is designed to maintain reliability while operating entirely within secure environments.
Modern AI datacenters operate on high-speed RoCE networks at 100–400 Gbps, where congestion control directly impacts large-scale training performance. Commercial RNICs typically ship with fixed, vendor-specific congestion-control algorithms and provide limited telemetry visibility, restricting detailed analysis at microsecond granularity across diverse traffic conditions. This makes systematic profiling and realistic performance evaluation extremely challenging.
To address this, we design and build a programmable FPGA-based RNIC that enables runtime replacement of congestion-control algorithms without modifying the datapath.
The platform exposes easy-to-use and robust telemetry hooks to observe throughput dynamics, flow behavior, and congestion responses in real time at line rate. By combining flexibility with precise measurement infrastructure, the system delivers a practical test vehicle for profiling and comparing congestion-control strategies under realistic AI datacenter workloads.
This project explores the behavior and performance of hardware-offloaded congestion algorithms in high-speed data center networks designed for large-scale AI and HPC workloads. As modern transports such as UET and Falcon move congestion control into NIC hardware, understanding how different algorithms respond to diverse congestion signals becomes critical for building efficient and stable networks.
The core objective of the project is to systematically study how different congestion algorithms react to precisely controlled congestion signals, including ECN, RTT, CSIG, credits, INT, and packet trimming, under realistic, high-speed conditions.
The project aims to design a framework that enables emulation of congestion signals directly in flight, allowing algorithms such as EQDS, Swift, DCQCN, TIMELY, and other UET-compatible designs to be evaluated without relying on large-scale incast or oversubscribed testbeds.
The rapid evolution of Large Language Models (LLMs) like GPT-4 and LLaMA has revolutionised artificial intelligence, but their deployment depends on massive, resource-intensive infrastructure. Training and serving these models requires distributed environments where thousands of GPUs communicate over high-performance networks. At this scale, hardware, network, and software faults are not just possible — they are inevitable.
In large-scale GPU clusters, faults range from hard failures such as GPU crashes and link disconnects to silent errors that are difficult to detect. Memory bit-flips can corrupt model parameters without triggering alerts, while network congestion and packet loss can stall gradient synchronisation. Software issues like CPU contention can also terminate jobs unexpectedly or degrade overall system throughput.
Our research develops proactive and lightweight anomaly detection solutions that not only detect faults and stragglers but also diagnose their root causes. The goal is to minimise downtime, improve reliability, and maintain performance in large-scale AI infrastructure used for both model training and serving.
Latency anomalies in remote procedural calls (RPC) among microservices are common. Particularly, latency anomalies due to packet-processing delays in the host network stack can shoot up end-to-end request completion time by tens of milliseconds. It takes many man-hours and expertise to debug the cause as there are many components (e.g., NIC, root namespace, container namespace) to blame, and the delay events at these components are transient and sporadic (e.g., Incast, CPU interference, burst). If not fixed quickly, delay events at a few components can compound, eventually leading to SLA violations on the completion time of many user requests. Application tracing tools like Jeager, though, provide great visibility into latency anomalies at the application layer but are not sufficient to localize delay events and associated components in the host network stack.
To diagnose RPC latency anomaly observed between a sender and a receiver, the first step is to localize the problem and then collect relevant traces for analysis. More specifically, localization involves finding whether the problem is at the host network stack or at other layers. If it is in the network stack, identify which component introducing packet-processing delays. After the localization, host-level traces collected using tools like perf, kprobes, and intel-pt, can be analyzed to find the cause.
The recent trend of programmable network hardware (switches, SmartNICs) and open network software ecosystem (ONF) on top of it provide new opportunities to rethink fundamental questions in Internet security. Network programmability gives flexibility to program high-speed hardware-based switches and implement novel network functions, system control, and higher-level services. Moreover, we can also reconfigure the same hardware to meet changing requirements.
For instance, the same network device can be reconfigured to implement one or more network functions like L2 forwarding, L3 routing, load balancing, NATing, border gateways, metering, firewall, in-network DDoS detection etc. We argue this is the time to leverage the same network programmability capabilities and implement security features for programmable devices which can shape the next-generation Internet.
To date, systems using programmable hardware are deployed in edge and cloud environments and there are few proof of concepts implementations of core network functions. However, their primary focus is on performance, availability, and security of applications running in their environment with little attention paid to important security challenges in programmable network infrastructure.
Service meshes like Istio support client-side load balancing where each upstream microservice with a sidecar proxy (e.g., envoy) distributes the requests across the next hop downstream microservices. However, contention for a shared host, GPU, and network resources in the edge cloud leads to latency anomalies, which degrades the performance of latency-critical applications such as deep-learning-based video inference as a service for resource-constrained robots deployed on an edge cloud, orchestrated by Kubernetes-Istio. The load balancer should quickly adapt to dynamic delay events such as (1) high load on the servers due to the asymmetry between easy-to-generate requests and resource-intensive response (GPU processing), and (2) sporadic and transient network congestion on paths toward the downstream service instances due to bandwidth-intensive nature of the requests (e.g., video inference).
A load balancer that quickly adapts to dynamic events by routing requests to the best possible servers while ensuring that the compute and network resources are not underutilized improves the response times, especially tail latencies, for latency-critical applications.
Many cloud infrastructure organizations increasingly rely on third-party eBPF-based network functions (e.g., Cilium) for use cases like security, observability, and load balancer, so that not everyone requires a team of highly skilled eBPF experts. But the network functions from a third party (e.g., F5, Palo Alto) are available in bytecode format to cloud operators, giving little or no understanding of possible bugs (functional correctness), packet-processing behavior, interaction with various kernels, and interaction with other network functions. More specifically, some examples of debugging and data privacy-related queries before deploying the bytecode are: Does it bypass other eBPF programs at the same hookpoint (e.g., XDP)? Does it make an unintended modification to a header field that another already deployed eBPF bytecode relies on? Does it copy sensitive information that violates policies on data privacy?
Generally, eBPF programs are written in high-level languages (like C, Rust, Python, etc.) and compiled into eBPF bytecode. Each statement at a high level can translate into multiple bytecode instructions, making it challenging to interpret and understand the relation to other instructions in the execution flow. Additionally, due to the limited number of registers and stack memory, large and complex eBPF programs reuse registers and memory locations, leading to frequent redefinitions that further obscure their behavior.
Network programmability has significantly increased the capabilities of both core networks and host networks. At the core network, one can specify the intended packet processing behavior in a program written in a domain-specific programming language like P4, and deploy it into the network devices. At the host network, using eBPF technology, one can add additional packet processing capabilities to the Linux kernel by deploying eBPF programs written in high-level languages like C, Python, and GO.
However, the ecosystem of programmable networks is increasingly becoming complex. Several components are involved in defining packet-processing behavior at the target device (switch or host). Some components are the program that captures the intended behavior (p4/eBPF), the compiler that translates high-level language programs to the target device language (p4c/clang-LLVM), the control plane that derives match-action rules at runtime (ONOS/Cilium), the software agent that configures the data plane at runtime (P4Runtime/eBPF maps).
Bugs in any one or more of these components introduce packet processing errors. Such bugs are difficult to detect (and mitigate) as they manifest themselves in any of the components either before or after the deployment of the program. But their presence will have a huge impact on the overall network performance. Our objective is to design and develop systems to detect the presence of such hard-to-catch bugs manifesting themselves at runtime.