How We Built an Enterprise-Grade Auto Annotation Tool

15 January, 2026 | 6 Min | By Dr. Srivatsan Sridhar

What if you could slash the time required to create a fully annotated, step-by-step tip sheet from hours to mere seconds? For EHR training and informatics teams in complex fields like Health IT, this is a game-changer. At 314e, we built an auto annotation tool that turns this into a reality for our AI-powered Just-in-Time training platform, Jeeves.

Jeeves already had the powerful capability to convert screen-recorded videos into multi-step "tip sheets" in minutes. But one crucial bottleneck remained: the lack of true automatic annotation. Highlighting key UI elements was a manual, time-consuming process that prevented end-to-end automation.

This post dives deep into the architecture we engineered to solve this: a resilient, scalable ML pipeline that powers near-instant automatic image annotation, transforming the content creation workflow from fast to fully automated.

Here's what we'll cover:

The Challenge - Architecting a Resilient Automatic Data Annotation Pipeline
A Four-Act Architecture
Our Efforts in Action
Transforming Creators Into Reviewers

The Challenge - Architecting a Resilient Automatic Data Annotation Pipeline

An effective auto-annotation process is far more than a single AI command. It’s a complex, multi-stage journey where the failure of one part could jeopardize the entire operation. We had to build a system that could reliably execute a sequence of demanding tasks, from video analysis to image recognition and text generation.

Our primary goals for the system were:

Resilience: AI models and processing tasks can occasionally be slow or run into errors. The system had to be smart enough to handle these hiccups gracefully, automatically retrying a step if it fails without bringing the entire process to a halt.
Awareness: The journey from a raw video to a fully annotated guide is a long-running process. The system needed to keep track of its progress at all times, knowing which frames were extracted, which summaries were generated, and what still needed to be done.
Scalability: Jeeves is used by many clients at once. The system had to be capable of handling a sudden influx of requests, processing hundreds of concurrent jobs without slowing down.
Clarity: For our support teams and developers, we needed clear visibility into every running job to quickly diagnose any issues.

Simpler approaches often prove too fragile for a task this complex. A failure in one step can break the entire chain, and managing the state of hundreds of simultaneous jobs becomes a major challenge. This led us to invest in an enterprise-grade workflow engine as the backbone of our system. This powerful coordinator handles all the complexity of retries, state management, and scaling, allowing our developers to focus on what matters most: the AI logic that makes automatic annotation possible.

A Four-Act Architecture

We can think of the generation process as a four-act play, managed by our sophisticated workflow engine.

Act I - Capturing the User's Journey

The process begins the moment a user uploads their screen recording. Our system's first responsibility is to establish a perfect, reliable record of the user's actions. The UI provides this directly in the form of a detailed map, linking each mouse click's timestamp to its video frame and precise coordinates.

The initial workflow takes this raw interaction data and formalizes it. The output is a structured JSON object that serves as the "single source of truth" for what the user did and when. This step ensures that every subsequent AI process is grounded in the reality of the user's actual journey.

Act II - Writing the Story and Adding the Visuals

With a clean set of key moments from the video, the system begins building the guide's narrative.

The process starts by turning the video’s spoken audio (captions) into a clean, readable set of instructions. An advanced language model reads through everything said and transforms it into a structured guide, essentially a list of steps, each with a short title and a description. At this point, the entire tutorial exists in text form only; no images are attached yet.

Once the text is ready, the system enriches it with visuals in a multi-stage process:

Matching steps to moments in the video: To figure out which part of the video corresponds to each written step, the system intelligently compares the wording of the steps with the wording of the original captions. By measuring how similar they are, it can pinpoint the exact moment each step happens in the video.
Pulling out the perfect frames: Using this precise timing information, it grabs frames from the video at the most relevant moments. Raw video, however, often contains long stretches where almost nothing changes. To avoid a guide full of repetitive screenshots, the system automatically compares the images and filters out any duplicates, keeping only the distinct and meaningful ones.
Saving the visuals: Finally, the selected frames are uploaded and linked to their corresponding steps in the guide. The tip sheet now has a complete script and the perfect images to illustrate each action. Here's a glimpse from the actual process:

Jeeves' AI Generating Tip Sheet Steps From a Video_314e

With the text and images in place, the system hands off each image to the next act for the next, magical step: automatic annotation.

Act III - The Vision Engine: Seeing and Understanding the Screen

This is where our system's "eyes" get to work. For each image, a dedicated process begins a comprehensive visual analysis. This is the core of our auto annotation tool.

The process is a powerful two-part sequence:

Finding every element: Our vision AI meticulously scans the image. One specialized model reads every piece of text on the screen and notes its location. Simultaneously, another specialized model identifies all the interactive elements like buttons, icons, and menus. The system then intelligently merges these two sets of findings, removing overlaps to create a single, clean map of every element on the screen.
Describing what it sees: Once all the elements are identified, another AI model enhances the descriptions. For elements that are purely visual (like a 'save' icon with no text), this model looks at the icon and generates a human-like description, such as "A blue save button." This ensures every identified element has a rich, understandable label.

This two-step process: first finding, then describing, makes the entire system more accurate and robust, ensuring high-quality results every time.

Act IV - "Click-Aware" Logic: The Smart Step in Auto Annotation

This final act is the magic the user experiences. When they click the "Auto Annotate" button, our system provides intelligent suggestions. Our initial attempts at this were simple; we just asked an AI to look at the step's description and guess what to highlight. The results were generic and often missed the mark.

We quickly realized that for the best annotations, our AI needed more than just instructions; it needed the user's context.

Our team engineered a "click-aware" process that provides our AI with three key pieces of information to make a brilliant decision:

The step's goal: The title and description of the current guide step.
All possible elements: The full list of buttons, fields, and icons our vision engine found on the screen.
The user's actions: The specific UI element the user actually clicked on, derived from the precise click map created in Act I.

By providing the AI with what the user was trying to do and what they actually did, its recommendations for what to highlight become remarkably accurate and relevant. It’s no longer guessing; it’s making an informed decision based on real user behavior.

Our Efforts in Action

Transforming Creators Into Reviewers

With this final piece in place, the entire process from video upload to a fully generated, intelligently annotated guide is complete in just a few seconds. This has fundamentally shifted the workflow for our users. A training manager's job is no longer to be a content creator, painstakingly drawing boxes on screenshots. Instead, they are now reviewers, giving a final approval to AI-generated content and using their expertise to refine, not to build from scratch.

Key Architectural Takeaways

Building this auto annotation tool taught us several invaluable lessons about architecting complex, AI-powered features:

Orchestrate, Don't Choreograph:

For any multi-stage, long-running process involving fallible services, a dedicated orchestrator like Temporal is a necessity, not a luxury. It abstracts away the immense complexity of state management, retries, and scalability, allowing engineering teams to focus on delivering business value.

Chain Specialized Models for Compounded Value:

Our most powerful results came from creating a specialist assembly line of models. Instead of relying on one giant model to solve everything, we used a pipeline of smaller, SOTA (State-of-the-Art) models. This approach allows each model to do what it does best, with a final LLM acting as an intelligent synthesizer to connect their outputs into a cohesive, valuable result.

Context Is the Catalyst for Intelligence:

Our initial recommendations were intelligent, but they lacked pinpoint accuracy. The true breakthrough came when we shifted our focus from perfecting the prompt's wording to enriching the data we fed into it. The missing ingredient was the user's own behavior. By incorporating the specific click data for each step, we gave the LLM the crucial context it needed to move from making educated guesses to performing genuine reasoning. This single addition was the key that elevated its output from generic to genuinely insightful, proving that the most valuable data often comes directly from the user's actions.

We are incredibly proud of the engineering behind this feature. It stands as a testament to how thoughtful architecture and a deep understanding of the user's workflow can turn a complex problem into a seamless, almost magical experience.

CONTRIBUTORS BEHIND THE BUILD