Hi, I’m Saurav Shrivastav
Software Engineer | Distributed Systems & Infrastructure Orchestration
I engineer the software that powers large-scale infrastructure. Currently, I am a Software Engineer at LinkedIn in the Reliability Infra organization, where I build distributed orchestration engines to manage a fleet of 80,000+ nodes. My work focuses on transforming complex operational challenges into scalable, “Infra-as-Code” solutions using Python, Go, and Temporal.
Read more about me at about or check out my latest posts in the blog.
What I’m Building
Currently, I focus on Hadoop/YARN/HDFS infrastructure at scale, with a growing focus on agentic automation:
- Distributed Orchestration: Designed and implemented state-machine driven remediation workflows using Temporal to manage host lifecycles across multi-datacenter deployments.
- Infrastructure-as-Code: Automated cluster expansion and host provisioning systems for a massive Hadoop fleet, recovering significant underutilized hardware capacity.
- Resource Management: Built heuristic-based allocation engines to proactively manage build pool hosts, reducing idle time by 88%.
- Platform Modernization: Leading core service upgrades to modern API frameworks to improve data correctness and system resilience.
The Current Sprint: AI & Systems
I believe the next frontier of infrastructure is Autonomous Reliability. I am currently documenting my journey in building:
- Agentic Workflows: Orchestrating LLMs via LangGraph to reason about and fix complex system faults.
- Reliable AI: Integrating Agentic loops with Temporal to ensure AI-driven remediations are consistent, durable, and safe for production.
- System Internals: Deep-diving into Consensus Protocols (Raft/Paxos) and high-performance networking with gRPC.
What You’ll Find Here
- Blog Posts: Technical deep-dives on distributed systems, race conditions, and building internal developer platforms.
- Papershelf: Analysis of foundational research papers—from storage engines like LSM-Trees to the latest in AI orchestration.
- Learning Journey: A public log of my builds, from gRPC log-intelligence services to self-healing grid agents.
Recent Notes & Articles
A comprehensive look at Large Language Models - from the Transformer architecture to the mechanics of inference and fine-tuning.
Demystifying the relationships between Artificial Intelligence, Machine Learning, and Deep Learning.
Bridging the gap between research and production: A practical look at AI Engineering.