Building a 400 Million Events per Day Serverless Observability System
Part 1: The Beginning
I remember walking through the doors on the first day of my internship feeling so excited to finally start working a company I had always dreamed of. I sat at reception waiting for my manager to come get me and introduce me to my team, already convinced that I would be handed my project on day one. I met everyone, got settled in and mentally prepared myself to start building straight away.
To my shock, I spent the next two weeks doing long and honestly boring onboarding materials. Every three days, I would go to my manager and ask for a sneak peek of my project. Each time he would smile, laugh a little and sympathise about how slow onboarding can feel, but he kept telling me to stick it out because I would only get my project once I was at least seventy percent done.
Two weeks later, the day finally came. My manager set up a meeting with him, my mentor and me to walk me through my project. I went into the room so excited to finally meet “my baby”. My mentor shared the project document with me, and the moment I opened it I felt shivers down my spine. I was an intern looking at a system with so many moving parts, dependencies and background context. I won’t lie, I was scared, but I was also excited about how much I was going to learn.
I worked with a team that owned a large-scale internal system used in the processing of customer account data. Let’s call it Orion. Over time, Orion had grown into a very complex system with many interconnected steps, and engineers would had to dig through multiple logs and tools to understand what happened when a customer account passed through it.
My job was to build a tool that could capture and visualise an account’s full journey through Orion. The aim was to reduce the long manual investigation process and replace it with a dashboard that showed the engineers what happened, where it happened and how long each process took.
I left the meeting scared and honestly didn’t ask a single question. I told them I needed time to digest what I’d just read and that I would set up another meeting in the coming days. After reading tons of documentations, interviewing teammates, asking endless questions and trying to understand the system from multiple angles, things slowly began to click.
Three days later, I met with my mentor again, this time with a long list of questions. For most of them, he hit me with the classic “you should be able to deal with ambiguity”. I left the meeting feeling frustrated because I felt like I had more gaps than answers. What I didn’t realise at the time was that he was pushing me to dive deeper and take full ownership of the project. And funny enough, when I did go deeper, I ended up answering a lot of my own questions.
System Design Phase
The next two weeks were dedicated entirely to system design. By this point, I had figured out some core principles the tool needed to follow.
While studying Orion’s structure, I identified a fixed series of operations every account goes through. Mapping this end-to-end journey gave me the foundation for designing an observability system around it. Using this structure, I estimated the scale of events the tool would need to ingest and process daily. The total volume was significant enough that the system needed to be built for high throughput, elasticity and reliability. Those estimates guided all my capacity planning decisions.
The tool also had to integrate with Orion without introducing delays, which meant using a lightweight, non-blocking fire and forget data collection pattern so the core system never slowed down.
I also needed storage that could handle large amounts of structured event data while staying cost-efficient. Engineers would rely on this tool to investigate issues quickly, so the database had to support fast lookups and scale automatically whenever traffic spiked.
System Design & Architecture
Before I wrote a single line of code, I had to produce design documents outlining my proposed architecture. These went through multiple rounds of reviews with my team and stakeholders. What I thought would take two weeks stretched into three, but the feedback made the design clearer, stronger and more defensible.
UIUX Design
The first thing I designed was the dashboard interface. I needed a way to visualise an account’s full processing journey in a way that felt intuitive but still gave engineers all the details they needed.
My UI was built around a timeline view that showed each major operation, its status, duration, dependencies and any errors. I added colour-coded indicators so failures were instantly recognisable. I also incorporated a way to show when operations ran in parallel, since Orion performs multiple asynchronous steps during processing.
The goal was for me was simple: let an engineer type in an accountID and executionID and immediately see a clean, structured picture of what happened during that run without scrolling endlessly through logs.
Event Pipeline Design
For the event pipeline, I used Kinesis to collect all events emitted by Orion. The stream captured everything and delivered the events in batches to a set of Lambda functions I wrote.
Inside the Lambda layer, I formatted each event before storing it in DynamoDB. Each record used:
a primary key: accountId#executionId
a sort key: timestamp#eventType
a state field showing whether the step started or ended
a status field indicating whether it ultimately succeeded or failed
Each event was written as its own record, which made it easy to reconstruct the entire timeline quickly.
On the query side, another Lambda function sat behind API Gateway. When an engineer searched for an account, the Lambda queried DynamoDB, retrieved all related events, ordered them, shaped them into a timeline and returned them to the frontend for visualisation.
What’s Next
In part 2, I’ll go deeper into my technology choices, break down the architecture further, talk about the key technical decisions I made, the constraints I hit during implementation and the alternatives I evaluated.

Great article, waiting for v2...
Such an amout of time spent on architecture... Waiting for part two