Skip to main content

[Part 1] Building a Low-Cost RAG Personal Knowledge Base with Cloudflare

1. Introduction

The generous Cloudflare strikes again! Not only do they offer free AI model invocations, but they also teach us how to build our own RAG (Retrieval-Augmented Generation).

What is RAG?

RAG is a technical architecture that combines information retrieval with large language model generation, enabling AI to:

  • First retrieve relevant information from an external knowledge base
  • Then generate answers based on the retrieved content

And all of this can be achieved through Cloudflare with just a Workers Paid plan ($5/month) 😹, which also includes a bunch of other features not detailed here.

The following content is based on this Cloudflare blog post. If you're comfortable with English, I recommend reading the original.

Build a Retrieval Augmented Generation (RAG) AI

1.1 Want to See the Code First?

All the code can be found on GitHub: demo-rag-ai-tutorial

1.2 Live Demo

There's also a live demo page where you can try it out: https://demo-rag.runnable.run/ui

2. Building a Retrieval Augmented Generation (RAG) AI · Cloudflare Workers AI Documentation

This guide will walk you through setting up and deploying your first application using Cloudflare AI. You'll build a fully functional AI application using Workers AI, Vectorize, D1, and Cloudflare Workers.

Want a Managed Solution?

AI Search provides a fully managed way to build RAG pipelines on Cloudflare, covering data ingestion, indexing, and querying out of the box. Get started now.

After completing this tutorial, you'll have an AI tool that can store information and query it using a large language model. This pattern is called Retrieval Augmented Generation (RAG), and you can implement it by combining multiple capabilities from the Cloudflare AI toolkit. No prior experience with AI tools is required to build this application.

  1. Sign up for a Cloudflare account.
  2. Install Node.js.
Node.js Version Manager

Use a Node version manager like Volta or nvm to avoid permission issues and switch Node.js versions. Wrangler, which you'll use later, requires Node version 16.17.0 or higher.

You'll also need access to Vectorize. This tutorial also demonstrates optional integration with Anthropic Claude. If you want to use it, you'll need an Anthropic API Key.

2.1. Create a New Worker Project

C3 (create-cloudflare-cli) is a command-line tool designed to help you quickly create and deploy Workers on Cloudflare.

Open your terminal and run C3 to create your Worker project:

npm create cloudflare@latest -- rag-ai-tutorial

In the interactive setup, select the following options:

  • For "What would you like to start with?", choose Hello World example
  • For "Which template would you like to use?", choose Worker only
  • For "Which language do you want to use?", choose JavaScript
  • For "Do you want to use git for version control?", choose Yes
  • For "Do you want to deploy your application?", choose No (we'll make some changes before deploying)

In your project directory, C3 will generate several files.

What files did C3 create?

  1. wrangler.jsonc: Your Wrangler configuration file
  2. worker.js (in /src): A minimal "Hello World!" Worker written using ES Module syntax
  3. package.json: Minimal Node dependencies configuration file
  4. package-lock.json: See npm documentation on package-lock.json
  5. node_modules: See npm documentation on node_modules

Then navigate to your newly created project directory:

cd rag-ai-tutorial

2.2. Develop with the Wrangler CLI

The Workers command-line tool Wrangler allows you to create, develop locally, and deploy Workers projects. C3 installs Wrangler by default for your project.

After creating your first Worker, run wrangler dev in your project directory to start a local development server for local testing during development.

npx wrangler dev

Now visit http://localhost:8787 to see your Worker running. Any changes to your code will trigger a rebuild, and refreshing the page will show the latest output.

2.3. Add AI Binding

To use Cloudflare's AI products, you can add an ai block as a remote binding in your Wrangler configuration file. This sets up a binding to Cloudflare AI models in your code, allowing you to interact with available models on the platform.

tip

If you haven't used Wrangler before, it will try to open a browser for you to log in with your Cloudflare account.

If you encounter issues at this step or cannot use the browser interface, refer to the wrangler login documentation.

This example uses the @cf/meta/llama-3-8b-instruct model, which is used for text generation.

{
"$schema": "./node_modules/wrangler/config-schema.json",
"ai": {
"binding": "AI",
"remote": true
}
}

Next, find the src/index.js file. Inside the fetch handler, you can make model calls through the AI binding:

export default {
async fetch(request, env, ctx) {
const answer = await env.AI.run("@cf/meta/llama-3-8b-instruct", {
messages: [{ role: "user", content: `What is the square root of 9?` }],
});


return new Response(JSON.stringify(answer));
},
};

By calling the LLM through the AI binding, we can interact directly with Cloudflare AI's large language models in our code. This example uses the @cf/meta/llama-3-8b-instruct model for text generation.

You can deploy your Worker using wrangler:

npx wrangler deploy

Making a request to your Worker will return a text response generated by the LLM as a JSON object.

curl https://example.username.workers.dev
{"response":"Answer: The square root of 9 is 3."}

2.4. Add Vector Embeddings with Cloudflare D1 and Vectorize

Embeddings allow you to add retrieval capabilities to language models in your Cloudflare AI projects. This is achieved through Vectorize, Cloudflare's vector database.

To get started with Vectorize, first create an embedding index using wrangler. This index stores 768-dimensional vectors and uses cosine similarity to determine similarity between vectors:

npx wrangler vectorize create vector-index --dimensions=768 --metric=cosine

Then add the configuration for your new Vectorize index to your Wrangler configuration file:

{
"$schema": "./node_modules/wrangler/config-schema.json",
"vectorize": [
{
"binding": "VECTOR_INDEX",
"index_name": "vector-index"
}
]
}

A vector index is used to store a set of dimensions (floating-point numbers) that represent your data. When you need to query, you also convert your query into a vector. Vectorize is designed to efficiently find stored vectors that are most similar to your query.

To implement search functionality, you need to set up a Cloudflare D1 database. You can store application data in D1, convert it to vector format, and when a user searches and matches a vector, retrieve and display the corresponding data.

Create a new D1 database using wrangler:

npx wrangler d1 create database

Then paste the configuration output from the previous command into your Wrangler configuration file:

{
"$schema": "./node_modules/wrangler/config-schema.json",
"d1_databases": [
{
"binding": "DB",
"database_name": "database",
"database_id": "abc-def-geh"
}
]
}

In this application, we'll create a notes table in D1 to store notes and retrieve them in Vectorize. Run the following SQL to create the table:

npx wrangler d1 execute database --remote --command "CREATE TABLE IF NOT EXISTS notes (id INTEGER PRIMARY KEY, text TEXT NOT NULL)"

Now use wrangler d1 execute to add a note to the database:

npx wrangler d1 execute database --remote --command "INSERT INTO notes (text) VALUES ('The best pizza topping is pepperoni')"

2.5. Create a Workflow

Before we start creating notes, let's introduce Cloudflare Workflow. It allows us to define a durable workflow to reliably execute all steps in the RAG process.

First, add a new [[workflows]] configuration block to your Wrangler configuration file:

{
"$schema": "./node_modules/wrangler/config-schema.json",
"workflows": [
{
"name": "rag",
"binding": "RAG_WORKFLOW",
"class_name": "RAGWorkflow"
}
]
}

Add a class named RAGWorkflow in src/index.js that extends WorkflowEntrypoint:

import { WorkflowEntrypoint } from "cloudflare:workers";


export class RAGWorkflow extends WorkflowEntrypoint {
async run(event, step) {
await step.do("example step", async () => {
console.log("Hello World!");
});
}
}

This class defines a simple workflow step that outputs "Hello World!" to the console. You can add more steps as needed.

The workflow itself doesn't execute automatically. To trigger it, you need to call the RAG_WORKFLOW binding and pass in the parameters required by the workflow. Here's an example call:

env.RAG_WORKFLOW.create({ params: { text } });

2.6. Create Notes and Write to Vectorize

To allow Workers to handle multiple routes, we'll introduce the Workers routing library hono to create a route for adding notes. Install hono via npm:

npm i hono

Then import hono in src/index.js and use it to refactor the fetch handler:

import { Hono } from "hono";
const app = new Hono();


app.get("/", async (c) => {
const answer = await c.env.AI.run("@cf/meta/llama-3-8b-instruct", {
messages: [{ role: "user", content: `What is the square root of 9?` }],
});


return c.json(answer);
});


export default app;

This creates a route at the root path / with functionality equivalent to the previous version of the application.

Next, update the workflow to write notes to the database and generate corresponding embedding vectors.

This example uses the @cf/baai/bge-base-en-v1.5 model for generating embeddings. Embeddings are stored in Vectorize and can be retrieved. User queries are also converted to embeddings for searching in Vectorize.

import { WorkflowEntrypoint } from "cloudflare:workers";


export class RAGWorkflow extends WorkflowEntrypoint {
async run(event, step) {
const env = this.env;
const { text } = event.payload;


const record = await step.do(`create database record`, async () => {
const query = "INSERT INTO notes (text) VALUES (?) RETURNING *";


const { results } = await env.DB.prepare(query).bind(text).run();


const record = results[0];
if (!record) throw new Error("Failed to create note");
return record;
});


const embedding = await step.do(`generate embedding`, async () => {
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: text,
});
const values = embeddings.data[0];
if (!values) throw new Error("Failed to generate vector embedding");
return values;
});


await step.do(`insert vector`, async () => {
return env.VECTOR_INDEX.upsert([
{
id: record.id.toString(),
values: embedding,
},
]);
});
}
}

This workflow performs the following steps:

  1. Receives the text parameter
  2. Inserts a new record into the D1 notes table and gets the new row's id
  3. Converts text to a vector using the LLM's embedding model
  4. Upserts the id and vector to Vectorize's vector-index index

This creates a vector representation for the note that can be used for retrieval later.

To complete the functionality, we add a route that allows users to submit notes to the database. This route parses the JSON request body, gets the text parameter, and creates a workflow instance with that parameter:

app.post("/notes", async (c) => {
const { text } = await c.req.json();
if (!text) return c.text("Missing text", 400);
await c.env.RAG_WORKFLOW.create({ params: { text } });
return c.text("Created note", 201);
});

2.7. Query Vectorize to Retrieve Notes

To complete the code, you can perform a Vectorize query at the root path (/). First convert the question to a vector, then use the vector-index index to find the most similar vectors.

The topK parameter limits the number of vectors returned. For example, with topK = 1, only the "most similar" vector based on the query is returned; setting it to 5 returns the 5 most similar vectors.

Based on the list of similar vectors, you can retrieve notes corresponding to the record IDs stored with those vectors. Here we only retrieve one note; you can customize as needed.

You can insert the text of these notes as context into the LLM's prompt. This is the foundation of Retrieval Augmented Generation (RAG): providing data outside the LLM as additional context to enhance its generation quality.

We'll update the prompt to include context and ask the LLM to use this context when answering:

import { Hono } from "hono";
const app = new Hono();


// Existing post route...
// app.post('/notes', async (c) => { ... })


app.get("/", async (c) => {
const question = c.req.query("text") || "What is the square root of 9?";


const embeddings = await c.env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: question,
});
const vectors = embeddings.data[0];


const vectorQuery = await c.env.VECTOR_INDEX.query(vectors, { topK: 1 });
let vecId;
if (
vectorQuery.matches &&
vectorQuery.matches.length > 0 &&
vectorQuery.matches[0]
) {
vecId = vectorQuery.matches[0].id;
} else {
console.log("No matching vector found or vectorQuery.matches is empty");
}


let notes = [];
if (vecId) {
const query = `SELECT * FROM notes WHERE id = ?`;
const { results } = await c.env.DB.prepare(query).bind(vecId).run();
if (results) notes = results.map((vec) => vec.text);
}


const contextMessage = notes.length
? `Context:\n${notes.map((note) => `- ${note}`).join("\n")}`
: "";


const systemPrompt = `When answering the question or responding, use the context provided, if it is provided and relevant.`;


const { response: answer } = await c.env.AI.run(
"@cf/meta/llama-3-8b-instruct",
{
messages: [
...(notes.length ? [{ role: "system", content: contextMessage }] : []),
{ role: "system", content: systemPrompt },
{ role: "user", content: question },
],
},
);


return c.text(answer);
});


app.onError((err, c) => {
return c.text(err);
});


export default app;

2.8. Integrate Anthropic Claude Model (Optional)

If you need to handle larger documents, you can optionally use Anthropic's Claude model, which has a larger context window and is well-suited for RAG workflows.

First install the @anthropic-ai/sdk package:

npm i @anthropic-ai/sdk

In src/index.js, you can update the GET / route to check for the ANTHROPIC_API_KEY environment variable: if it exists, use the Anthropic SDK to generate text; otherwise fall back to the current Workers AI code:

import Anthropic from '@anthropic-ai/sdk';


app.get('/', async (c) => {
// ... Existing code
const systemPrompt = `When answering the question or responding, use the context provided, if it is provided and relevant.`


let modelUsed: string = ""
let response = null


if (c.env.ANTHROPIC_API_KEY) {
const anthropic = new Anthropic({
apiKey: c.env.ANTHROPIC_API_KEY
})


const model = "claude-3-5-sonnet-latest"
modelUsed = model


const message = await anthropic.messages.create({
max_tokens: 1024,
model,
messages: [
{ role: 'user', content: question }
],
system: [systemPrompt, notes ? contextMessage : ''].join(" ")
})


response = {
response: message.content.map(content => content.text).join("\n")
}
} else {
const model = "@cf/meta/llama-3.1-8b-instruct"
modelUsed = model


response = await c.env.AI.run(
model,
{
messages: [
...(notes.length ? [{ role: 'system', content: contextMessage }] : []),
{ role: 'system', content: systemPrompt },
{ role: 'user', content: question }
]
}
)
}


if (response) {
c.header('x-model-used', modelUsed)
return c.text(response.response)
} else {
return c.text("We were unable to generate output", 500)
}
})

Finally, you need to set the ANTHROPIC_API_KEY environment variable in your Workers application. You can add it with the following command:

$ npx wrangler secret put ANTHROPIC_API_KEY

2.9. Delete Notes and Vectors

If you no longer need a note, you can delete it from the database. When deleting a note, you also need to delete the corresponding vector in Vectorize. Implement the DELETE /notes/:id route in src/index.js:

app.delete("/notes/:id", async (c) => {
const { id } = c.req.param();


const query = `DELETE FROM notes WHERE id = ?`;
await c.env.DB.prepare(query).bind(id).run();


await c.env.VECTOR_INDEX.deleteByIds([id]);


return c.status(204);
});

2.10. Text Chunking (Optional)

For large blocks of text, it's recommended to split them into smaller chunks. This helps the LLM aggregate relevant context more effectively without having to retrieve overly large text.

To implement this functionality, add the NPM package @langchain/textsplitters to your project:

npm i @langchain/textsplitters

The RecursiveCharacterTextSplitter class in this package can split text into smaller chunks. You can customize the configuration as needed; the default configuration is sufficient for most scenarios:

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";


const text = "Some long piece of text...";


const splitter = new RecursiveCharacterTextSplitter({
// These can be customized to change the chunking size
// chunkSize: 1000,
// chunkOverlap: 200,
});


const output = await splitter.createDocuments([text]);
console.log(output); // [{ pageContent: 'Some long piece of text...' }]

To use this splitter, we'll update the workflow to split text into multiple chunks, then iterate through each chunk and execute subsequent workflow steps for each:

export class RAGWorkflow extends WorkflowEntrypoint {
async run(event, step) {
const env = this.env;
const { text } = event.payload;
let texts = await step.do("split text", async () => {
const splitter = new RecursiveCharacterTextSplitter();
const output = await splitter.createDocuments([text]);
return output.map((doc) => doc.pageContent);
});


console.log(
"RecursiveCharacterTextSplitter generated ${texts.length} chunks",
);


for (const index in texts) {
const text = texts[index];
const record = await step.do(
`create database record: ${index}/${texts.length}`,
async () => {
const query = "INSERT INTO notes (text) VALUES (?) RETURNING *";


const { results } = await env.DB.prepare(query).bind(text).run();


const record = results[0];
if (!record) throw new Error("Failed to create note");
return record;
},
);


const embedding = await step.do(
`generate embedding: ${index}/${texts.length}`,
async () => {
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: text,
});
const values = embeddings.data[0];
if (!values) throw new Error("Failed to generate vector embedding");
return values;
},
);


await step.do(`insert vector: ${index}/${texts.length}`, async () => {
return env.VECTOR_INDEX.upsert([
{
id: record.id.toString(),
values: embedding,
},
]);
});
}
}
}

Now when large blocks of text are submitted to the /notes endpoint, they will be split into smaller chunks and processed separately by the workflow.

2.11. Deploy Your Project

If you didn't deploy your Worker in step 1, use Wrangler to deploy to a *.workers.dev subdomain or to a custom domain you've configured. If you haven't configured any subdomain or domain yet, Wrangler will prompt you to set one up during the publishing process.

npx wrangler deploy

You can preview your Worker at <YOUR_WORKER>.<YOUR_SUBDOMAIN>.workers.dev.

Note

When first pushing to a *.workers.dev subdomain, you may see 523 errors during DNS propagation. These errors usually resolve themselves within about a minute.

3. Test Results

3.1 Add a Note

3.2 Workflow Execution

3.3 Query the Knowledge Base

4. Conclusion

With this, we've completed a simple personal RAG application, which is one way to combine a knowledge base with AI.

All the code can be found on GitHub: demo-rag-ai-tutorial

Cloudflare also provides an official version: https://github.com/kristianfreeman/cloudflare-retrieval-augmented-generation-example/

Learn more: