Building An AI Assistant at the Edge

July 5, 2024 • 9 min read

How difficult is it to build a voice assistant with current off-the-shelf AI models? In this article, we will explain how we built a simple AI assistant using Vue 3, Nuxt, and Llama 3 8B on top of Cloudflare services.

The complete project source code is available on GitHub: marmelab/cloudflare-ai-assistant.

The Target

We recently discovered Aqua, a voice-native AI assistant and text editor. We found the demo quite impressive:

YouTube might track you and we would rather have your consent before loading this video.

Always allow

Then we asked ourselves: How hard would it be to build a clone of this agent using off-the-shelf solutions? Our proof-of-concept, built in a day, is still far from the original, but it’s encouraging:

It’s a good starting point, but more interestingly, this proof-of-concept helped us understand some of the challenges of building an AI assistant.

The Scope

We worked on this AI-assisted text editor during one of Marmelab’s Hack Days. This means we were time-constrained and had to focus on the core features. We divided the application into the following user stories:

As Abigail, I want to draft a shopping list using my keyboard
As Abigail, I want to improve a shopping list using my keyboard
As Abigail, I want to improve a shopping list using my voice

We chose to focus on the first two user stories and built a simple text-to-text generation assistant. We also explored voice recognition during our hack days, and we know we can plug in any solution we come up with. We will cover voice recognition in a future article.

The Stack

We knew we needed an LLM to process user input and generate the assistant’s response. We chose to use a hosted service to save time. We also wanted to run the model at the edge to reduce latency. We chose Cloudflare Workers AI for this purpose.

Zoom In Cloudflare Workers AI

Cloudflare is a cloud platform that initially provided Reverse Proxy and Content Delivery services. With the release of Cloudflare Workers in 2017, the company offers serverless computing services at the edge. In late 2023, they expanded their product range by launching a serverless GPU inference service: Workers AI. This new service allows developers to run machine learning models at the edge on top of the Cloudflare Global Network.

As the assistant’s text editing task doesn’t require deep knowledge or sophisticated reasoning, we opted for a relatively small LLM, the Llama 3 8B open-source model.

For the frontend, we chose to develop it using Vue.js 3 and Nuxt. We used the Nuxt Cloudflare module to proxy the API calls during development.

Generating Text Using The Cloudflare AI SDK

Nuxt relies on the h3 web server (an alternative to Express) to define API routes. H3 can work on various runtimes (Node.js, Deno, Bun, and Cloudflare). Nuxt adds h3’s functions (defineEventHandler, setResponseHeaders, readValidatedBody, etc) to the global scope. We added input validation using zod. The resulting API endpoint that does text-to-text inference is available below.

import { Ai } from '@cloudflare/ai';
import { z } from 'zod';

const ROLE_ASSISTANT = "assistant";
const ROLE_USER = "user";
const ROLE_SYSTEM = "system";
const SYSTEM_PROMPT = `
You are an AI assistant that help users to take notes or write emails.
You keep you notes and emails as simple as possible.`;

const messageSchema = z.object({
  role: z.enum([ROLE_USER, ROLE_ASSISTANT]),
  content: z.string(),
});

const bodySchema = z.object({
  messages: z
    .array(messageSchema)
    .min(1);
});

export default defineEventHandler(async event => {
    const cloudflareBindings = event.context?.cloudflare?.env;
    if (!cloudflareBindings) {
        throw new Error('No Cloudflare bindings found.');
    }

    const body = await readValidatedBody(event, bodySchema.safeParse);
    if (!body.success) {
        // We return an error here
        return;
    }

    const ai = new Ai(cloudflareBindings.AI);

    const stream = await ai.run('@cf/meta/llama-3-8b-instruct', {
        messages: [
            { role: ROLE_SYSTEM, content: SYSTEM_PROMPT },
            ...body.data.messages,
        ],
        stream: true,
    });

    setResponseHeaders(event, {
        'content-type': 'text/event-stream',
        'cache-control': 'no-cache',
    });
    return stream;
});

The @cloudflare/ai package provides a simple JS API to run inferences on the Cloudflare Workers AI service, abstracting the underlying model and the API calls.

Cloudflare Workers AI supports both streaming and non-streaming modes. However, we faced timeouts on the Cloudflare side during development with non-streaming mode. When this occurred, we would only receive a partial response with cropped text. This is why we implemented streaming mode for this endpoint.

The streaming mode returns a text/event-stream response. This format is interesting since it is plug-and-play with the EventSource API.

Yet, since the prompt endpoint we implemented is HTTP POST-based, we cannot use EventSource in this case as it only supports GET requests. The next part will focus on sending prompts and receiving generated text on the frontend using POST requests.

Deploying the API endpoints to Cloudflare Workers is straightforward because Nuxt developers provide an out-of-the-box Cloudflare Workers integration.

Calling the API Endpoint from the Frontend

Vue 3 introduces composables to reuse logic across an application. If you know React, Vue’s composables are like custom hooks. We decided to write the text generation logic in a composable as it is easier to maintain over time.

Our text generation composable returns the loading state, the generated text (received as a stream), and a function to perform the prompt. Furthermore, as mentioned earlier, this composable is also in charge of parsing the text/event-stream received from the prompt endpoint, as EventSource does not support POST-based requests.

The composable stores the message history (for both user prompts and assistant responses) as well, and sends all the history back to the prompt API when the user interacts with it. We found that keeping the context across requests gives better results.

import { ref } from "vue";

export type Prompt = {
  role: "user" | "assistant";
  content: string;
};

export default function usePrompt() {
  // public API
  const messages = ref<Prompt[]>([]);
  const loading = ref<boolean>(false);

  async function sendPrompt(prompt: string) {
    const trimedPrompt = prompt.trim();
    if (!trimedPrompt || loading.value) {
      // TODO: handle error
      return;
    }

    messages.value = [
      ...messages.value,
      { role: "user", content: trimedPrompt },
    ];

    loading.value = true;

    const currentMessages = [...messages.value];
    messages.value = [...messages.value, { role: "assistant", content: "" }];

    await fetch("/api/prompt", {
      method: "POST",
      headers: [["Content-Type", "application/json"]],
      body: JSON.stringify({ messages: messages.value }),
    })
      .then(async (response) => {
        if (response.status !== 200 || !response.body) {
          // TODO: handle error
          return;
        }

        const reader = response.body.getReader();
        const decoder = new TextDecoder("utf-8");

        if (!reader) {
          return;
        }

        let responseBody = "";
        while (true) {
          const { done, value } = await reader.read();

          if (done) {
            break;
          }

          responseBody += decoder.decode(value);

          const content = responseBody
            .split("data:")
            .map((part) => {
              const json = part.trim();
              if (!json || json === "[DONE]") {
                return null;
              }

              try {
                const { response } = JSON.parse(json) as {
                  response: string;
                };
                return response;
              } catch {
                return null;
              }
            })
            .filter((part) => part !== null)
            .join("");

          messages.value = [...currentMessages, { role: "assistant", content }];
        }
      })
      .finally(function () {
        loading.value = false;
      });
  }

  return {
    loading,
    messages,
    sendPrompt,
  };
}

To parse the event stream, we chose to split the received text data using the data: token as the divider. While this is not compliant with the full event stream specification, it was sufficient for our use case. Initially, we faced a bug where some tokens were missing as we parsed the response body in real-time. This was due to the fact that a chunk might contain multiple data: parts. To fix this bug, we stored the entire response body in memory and parsed it each time we received a new chunk.

Integrating Text Generation in the User Interface

After setting up the text generation composable, we used it in our user interface. An example of how to use the composable is available below.

<style>
  pre {
    white-space: pre-wrap; /* Since CSS 2.1 */
    white-space: -moz-pre-wrap; /* Mozilla, since 1999 */
    white-space: -pre-wrap; /* Opera 4-6 */
    white-space: -o-pre-wrap; /* Opera 7 */
    word-wrap: break-word; /* Internet Explorer 5.5+ */
  }
</style>

<script setup lang="ts">
  import { ref, watch } from "vue";
  import usePrompt from "../composables/usePrompt";

  const prompt = ref<string>("");
  const submitButton = ref<HTMLButtonElement | null>(null);

  const { loading, sendPrompt, messages } = usePrompt();

  function submit() {
    submitButton.value?.click?.();
  }

  async function onSubmit(event: Event) {
    event.preventDefault();

    const promptFormValue = prompt.value.trim();
    if (loading.value || !promptFormValue) {
      return;
    }

    await sendPrompt(promptFormValue);

    prompt.value = "";
  }

  let lastAssistantMessage = ref<Prompt>(null);

  watch(messages, (newMessages) => {
    lastAssistantMessage.value = newMessages.findLast(
      (message) => message.role === "assistant",
    );
  });
</script>

<template data-theme="cupcake">
  <pre v-if="lastAssistantMessage?.content">
{{ lastAssistantMessage?.content }}</pre
  >

  <form @submit="onSubmit">
    <textarea
      placeholder="Type your message..."
      v-model="prompt"
      @keydown.meta.enter="submit"
      @keydown.ctrl.enter="submit"
      :disabled="loading"
    />

    <button type="submit" ref="submitButton" :disabled="loading">
      <span class="loading loading-spinner" v-if="loading" />
      <SendIcon v-else />
    </button>
  </form>
</template>

An interesting feature of Vue is how we can easily bind composed keyboard events using keydown.meta.enter and keydown.ctrl.enter attributes on inputs. With React, this would have required filtering the key combination within the event handler to provide the same functionality.

Results, Limitations, and Future Directions

We tested the assistant with various prompts and found that the Llama 3 8B model is quite good at distinguishing between text to retain and editing instructions, even with a very simple system prompt.

We will need to conduct a more thorough evaluation to determine the model’s performance on a wider range of text types. We also need to test the model’s ability to generate text in different languages.

Using the keyboard is definitely a limiting factor in the user experience. While it allows easy testing of the assistant, it turns out to be more error-prone and time-consuming than directly editing the result. An AI-assisted text editor is only useful if you don’t have access to a keyboard, like when you’re driving or cooking. This is why we will focus on voice recognition in a future article.

Latency is also a concern. Our assistant has to wait for the model to generate the entire response before displaying it. Even with a small model like Llama 3 8B, this means waiting long time intervals after each prompt, especially when the edited text counts hundreds of words. The advantage provided by running the LLM at the edge isn’t really significant here: the token generation speed is the bottleneck.

Finally, the current user interface is too brittle. We systematically replace the previous text with the new text after each LLM call. The user has to spot the differences or, more often, read the entire text again to check the changes. To make it a real tool, we would need to implement a diffing algorithm to highlight the changes between the previous and new text, just like Aqua does.

Diffing in Aqua

Conclusion

Overall, we are quite impressed with the performance of the Llama 3 model as an AI assistant. Yet, the model had some strange responses during my tests where some of the tokens were lost and reappeared after some other prompts. Besides that, the model exhibited very few hallucinations during my tests and understood prompts well in most cases.

Furthermore, we recommend Cloudflare Workers AI as it provides a great API to run LLM inferences on a wide range of models. Moreover, their documentation is well-written and offers comprehensive examples. Finally, the integration with the Nuxt framework is great and provides a really good development experience.

A future article will focus on the integration of the Whisper model for speech recognition to implement a vocal AI assistant instead of a text-based one.

Authors

Jonathan Arnault

Full-stack web developer at marmelab, Jonathan likes to cook and do photography on his spare time.

Anthony Rimet

Full-stack web developer at marmelab, Anthony seeks to improve and learn every day. He likes basketball, motorsports and is a big Harry Potter fan

Ready to build something extraordinary?

Our team of talented full-stack developers is ready to tackle your next web or mobile project. Let's build it together!