[Part 1] Building a Low-Cost RAG Personal Knowledge Base with Cloudflare
1. Introduction
The generous Cloudflare strikes again! Not only do they offer free AI model invocations, but they also teach us how to build our own RAG (Retrieval-Augmented Generation).
RAG is a technical architecture that combines information retrieval with large language model generation, enabling AI to:
- First retrieve relevant information from an external knowledge base
- Then generate answers based on the retrieved content
And all of this can be achieved through Cloudflare with just a Workers Paid plan ($5/month) 😹, which also includes a bunch of other features not detailed here.
The following content is based on this Cloudflare blog post. If you're comfortable with English, I recommend reading the original.
Build a Retrieval Augmented Generation (RAG) AI
1.1 Want to See the Code First?
All the code can be found on GitHub: demo-rag-ai-tutorial
1.2 Live Demo
There's also a live demo page where you can try it out: https://demo-rag.runnable.run/ui
2. Building a Retrieval Augmented Generation (RAG) AI · Cloudflare Workers AI Documentation
This guide will walk you through setting up and deploying your first application using Cloudflare AI. You'll build a fully functional AI application using Workers AI, Vectorize, D1, and Cloudflare Workers.
AI Search provides a fully managed way to build RAG pipelines on Cloudflare, covering data ingestion, indexing, and querying out of the box. Get started now.
After completing this tutorial, you'll have an AI tool that can store information and query it using a large language model. This pattern is called Retrieval Augmented Generation (RAG), and you can implement it by combining multiple capabilities from the Cloudflare AI toolkit. No prior experience with AI tools is required to build this application.
- Sign up for a Cloudflare account.
- Install
Node.js.
You'll also need access to Vectorize. This tutorial also demonstrates optional integration with Anthropic Claude. If you want to use it, you'll need an Anthropic API Key.
2.1. Create a New Worker Project
C3 (create-cloudflare-cli) is a command-line tool designed to help you quickly create and deploy Workers on Cloudflare.
Open your terminal and run C3 to create your Worker project:
- npm
- yarn
- pnpm
npm create cloudflare@latest -- rag-ai-tutorial
yarn create cloudflare rag-ai-tutorial
pnpm create cloudflare@latest rag-ai-tutorial
In the interactive setup, select the following options:
- For "What would you like to start with?", choose
Hello World example - For "Which template would you like to use?", choose
Worker only - For "Which language do you want to use?", choose
JavaScript - For "Do you want to use git for version control?", choose
Yes - For "Do you want to deploy your application?", choose
No(we'll make some changes before deploying)
In your project directory, C3 will generate several files.
What files did C3 create?
wrangler.jsonc: Your Wrangler configuration fileworker.js(in/src): A minimal "Hello World!" Worker written using ES Module syntaxpackage.json: Minimal Node dependencies configuration filepackage-lock.json: Seenpmdocumentation onpackage-lock.jsonnode_modules: Seenpmdocumentation onnode_modules
Then navigate to your newly created project directory:
cd rag-ai-tutorial
2.2. Develop with the Wrangler CLI
The Workers command-line tool Wrangler allows you to create, develop locally, and deploy Workers projects. C3 installs Wrangler by default for your project.
After creating your first Worker, run wrangler dev in your project directory to start a local development server for local testing during development.
npx wrangler dev
Now visit http://localhost:8787 to see your Worker running. Any changes to your code will trigger a rebuild, and refreshing the page will show the latest output.
2.3. Add AI Binding
To use Cloudflare's AI products, you can add an ai block as a remote binding in your Wrangler configuration file. This sets up a binding to Cloudflare AI models in your code, allowing you to interact with available models on the platform.
If you haven't used Wrangler before, it will try to open a browser for you to log in with your Cloudflare account.
If you encounter issues at this step or cannot use the browser interface, refer to the wrangler login documentation.
This example uses the @cf/meta/llama-3-8b-instruct model, which is used for text generation.
- wrangler.jsonc
- wrangler.toml
{
"$schema": "./node_modules/wrangler/config-schema.json",
"ai": {
"binding": "AI",
"remote": true
}
}
[ai]
binding = "AI"
remote = true
Next, find the src/index.js file. Inside the fetch handler, you can make model calls through the AI binding:
export default {
async fetch(request, env, ctx) {
const answer = await env.AI.run("@cf/meta/llama-3-8b-instruct", {
messages: [{ role: "user", content: `What is the square root of 9?` }],
});
return new Response(JSON.stringify(answer));
},
};
By calling the LLM through the AI binding, we can interact directly with Cloudflare AI's large language models in our code. This example uses the @cf/meta/llama-3-8b-instruct model for text generation.
You can deploy your Worker using wrangler:
npx wrangler deploy
Making a request to your Worker will return a text response generated by the LLM as a JSON object.
curl https://example.username.workers.dev
{"response":"Answer: The square root of 9 is 3."}
2.4. Add Vector Embeddings with Cloudflare D1 and Vectorize
Embeddings allow you to add retrieval capabilities to language models in your Cloudflare AI projects. This is achieved through Vectorize, Cloudflare's vector database.
To get started with Vectorize, first create an embedding index using wrangler. This index stores 768-dimensional vectors and uses cosine similarity to determine similarity between vectors:
npx wrangler vectorize create vector-index --dimensions=768 --metric=cosine
Then add the configuration for your new Vectorize index to your Wrangler configuration file:
- wrangler.jsonc
- wrangler.toml
{
"$schema": "./node_modules/wrangler/config-schema.json",
"vectorize": [
{
"binding": "VECTOR_INDEX",
"index_name": "vector-index"
}
]
}
# ... existing wrangler configuration
[[vectorize]]
binding = "VECTOR_INDEX"
index_name = "vector-index"
A vector index is used to store a set of dimensions (floating-point numbers) that represent your data. When you need to query, you also convert your query into a vector. Vectorize is designed to efficiently find stored vectors that are most similar to your query.
To implement search functionality, you need to set up a Cloudflare D1 database. You can store application data in D1, convert it to vector format, and when a user searches and matches a vector, retrieve and display the corresponding data.
Create a new D1 database using wrangler:
npx wrangler d1 create database
Then paste the configuration output from the previous command into your Wrangler configuration file:
- wrangler.jsonc
- wrangler.toml
{
"$schema": "./node_modules/wrangler/config-schema.json",
"d1_databases": [
{
"binding": "DB",
"database_name": "database",
"database_id": "abc-def-geh"
}
]
}
# ... existing wrangler configuration
[[d1_databases]]
binding = "DB" # available in your Worker on env.DB
database_name = "database"
database_id = "abc-def-geh" # replace this with a real database_id (UUID)
In this application, we'll create a notes table in D1 to store notes and retrieve them in Vectorize. Run the following SQL to create the table:
npx wrangler d1 execute database --remote --command "CREATE TABLE IF NOT EXISTS notes (id INTEGER PRIMARY KEY, text TEXT NOT NULL)"
Now use wrangler d1 execute to add a note to the database:
npx wrangler d1 execute database --remote --command "INSERT INTO notes (text) VALUES ('The best pizza topping is pepperoni')"
2.5. Create a Workflow
Before we start creating notes, let's introduce Cloudflare Workflow. It allows us to define a durable workflow to reliably execute all steps in the RAG process.
First, add a new [[workflows]] configuration block to your Wrangler configuration file:
- wrangler.jsonc
- wrangler.toml
{
"$schema": "./node_modules/wrangler/config-schema.json",
"workflows": [
{
"name": "rag",
"binding": "RAG_WORKFLOW",
"class_name": "RAGWorkflow"
}
]
}
# ... existing wrangler configuration
[[workflows]]
name = "rag"
binding = "RAG_WORKFLOW"
class_name = "RAGWorkflow"
Add a class named RAGWorkflow in src/index.js that extends WorkflowEntrypoint:
import { WorkflowEntrypoint } from "cloudflare:workers";
export class RAGWorkflow extends WorkflowEntrypoint {
async run(event, step) {
await step.do("example step", async () => {
console.log("Hello World!");
});
}
}
This class defines a simple workflow step that outputs "Hello World!" to the console. You can add more steps as needed.
The workflow itself doesn't execute automatically. To trigger it, you need to call the RAG_WORKFLOW binding and pass in the parameters required by the workflow. Here's an example call:
env.RAG_WORKFLOW.create({ params: { text } });
2.6. Create Notes and Write to Vectorize
To allow Workers to handle multiple routes, we'll introduce the Workers routing library hono to create a route for adding notes. Install hono via npm:
- npm
- yarn
- pnpm
npm i hono
yarn add hono
pnpm add hono
Then import hono in src/index.js and use it to refactor the fetch handler:
import { Hono } from "hono";
const app = new Hono();
app.get("/", async (c) => {
const answer = await c.env.AI.run("@cf/meta/llama-3-8b-instruct", {
messages: [{ role: "user", content: `What is the square root of 9?` }],
});
return c.json(answer);
});
export default app;
This creates a route at the root path / with functionality equivalent to the previous version of the application.
Next, update the workflow to write notes to the database and generate corresponding embedding vectors.
This example uses the @cf/baai/bge-base-en-v1.5 model for generating embeddings. Embeddings are stored in Vectorize and can be retrieved. User queries are also converted to embeddings for searching in Vectorize.
import { WorkflowEntrypoint } from "cloudflare:workers";
export class RAGWorkflow extends WorkflowEntrypoint {
async run(event, step) {
const env = this.env;
const { text } = event.payload;
const record = await step.do(`create database record`, async () => {
const query = "INSERT INTO notes (text) VALUES (?) RETURNING *";
const { results } = await env.DB.prepare(query).bind(text).run();
const record = results[0];
if (!record) throw new Error("Failed to create note");
return record;
});
const embedding = await step.do(`generate embedding`, async () => {
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: text,
});
const values = embeddings.data[0];
if (!values) throw new Error("Failed to generate vector embedding");
return values;
});
await step.do(`insert vector`, async () => {
return env.VECTOR_INDEX.upsert([
{
id: record.id.toString(),
values: embedding,
},
]);
});
}
}
This workflow performs the following steps:
- Receives the
textparameter - Inserts a new record into the D1
notestable and gets the new row'sid - Converts
textto a vector using the LLM's embedding model - Upserts the
idand vector to Vectorize'svector-indexindex
This creates a vector representation for the note that can be used for retrieval later.
To complete the functionality, we add a route that allows users to submit notes to the database. This route parses the JSON request body, gets the text parameter, and creates a workflow instance with that parameter:
app.post("/notes", async (c) => {
const { text } = await c.req.json();
if (!text) return c.text("Missing text", 400);
await c.env.RAG_WORKFLOW.create({ params: { text } });
return c.text("Created note", 201);
});
2.7. Query Vectorize to Retrieve Notes
To complete the code, you can perform a Vectorize query at the root path (/). First convert the question to a vector, then use the vector-index index to find the most similar vectors.
The topK parameter limits the number of vectors returned. For example, with topK = 1, only the "most similar" vector based on the query is returned; setting it to 5 returns the 5 most similar vectors.
Based on the list of similar vectors, you can retrieve notes corresponding to the record IDs stored with those vectors. Here we only retrieve one note; you can customize as needed.
You can insert the text of these notes as context into the LLM's prompt. This is the foundation of Retrieval Augmented Generation (RAG): providing data outside the LLM as additional context to enhance its generation quality.
We'll update the prompt to include context and ask the LLM to use this context when answering:
import { Hono } from "hono";
const app = new Hono();
// Existing post route...
// app.post('/notes', async (c) => { ... })
app.get("/", async (c) => {
const question = c.req.query("text") || "What is the square root of 9?";
const embeddings = await c.env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: question,
});
const vectors = embeddings.data[0];
const vectorQuery = await c.env.VECTOR_INDEX.query(vectors, { topK: 1 });
let vecId;
if (
vectorQuery.matches &&
vectorQuery.matches.length > 0 &&
vectorQuery.matches[0]
) {
vecId = vectorQuery.matches[0].id;
} else {
console.log("No matching vector found or vectorQuery.matches is empty");
}
let notes = [];
if (vecId) {
const query = `SELECT * FROM notes WHERE id = ?`;
const { results } = await c.env.DB.prepare(query).bind(vecId).run();
if (results) notes = results.map((vec) => vec.text);
}
const contextMessage = notes.length
? `Context:\n${notes.map((note) => `- ${note}`).join("\n")}`
: "";
const systemPrompt = `When answering the question or responding, use the context provided, if it is provided and relevant.`;
const { response: answer } = await c.env.AI.run(
"@cf/meta/llama-3-8b-instruct",
{
messages: [
...(notes.length ? [{ role: "system", content: contextMessage }] : []),
{ role: "system", content: systemPrompt },
{ role: "user", content: question },
],
},
);
return c.text(answer);
});
app.onError((err, c) => {
return c.text(err);
});
export default app;
2.8. Integrate Anthropic Claude Model (Optional)
If you need to handle larger documents, you can optionally use Anthropic's Claude model, which has a larger context window and is well-suited for RAG workflows.
First install the @anthropic-ai/sdk package:
- npm
- yarn
- pnpm
npm i @anthropic-ai/sdk
yarn add @anthropic-ai/sdk
pnpm add @anthropic-ai/sdk
In src/index.js, you can update the GET / route to check for the ANTHROPIC_API_KEY environment variable: if it exists, use the Anthropic SDK to generate text; otherwise fall back to the current Workers AI code:
import Anthropic from '@anthropic-ai/sdk';
app.get('/', async (c) => {
// ... Existing code
const systemPrompt = `When answering the question or responding, use the context provided, if it is provided and relevant.`
let modelUsed: string = ""
let response = null
if (c.env.ANTHROPIC_API_KEY) {
const anthropic = new Anthropic({
apiKey: c.env.ANTHROPIC_API_KEY
})
const model = "claude-3-5-sonnet-latest"
modelUsed = model
const message = await anthropic.messages.create({
max_tokens: 1024,
model,
messages: [
{ role: 'user', content: question }
],
system: [systemPrompt, notes ? contextMessage : ''].join(" ")
})
response = {
response: message.content.map(content => content.text).join("\n")
}
} else {
const model = "@cf/meta/llama-3.1-8b-instruct"
modelUsed = model
response = await c.env.AI.run(
model,
{
messages: [
...(notes.length ? [{ role: 'system', content: contextMessage }] : []),
{ role: 'system', content: systemPrompt },
{ role: 'user', content: question }
]
}
)
}
if (response) {
c.header('x-model-used', modelUsed)
return c.text(response.response)
} else {
return c.text("We were unable to generate output", 500)
}
})
Finally, you need to set the ANTHROPIC_API_KEY environment variable in your Workers application. You can add it with the following command:
$ npx wrangler secret put ANTHROPIC_API_KEY
2.9. Delete Notes and Vectors
If you no longer need a note, you can delete it from the database. When deleting a note, you also need to delete the corresponding vector in Vectorize. Implement the DELETE /notes/:id route in src/index.js:
app.delete("/notes/:id", async (c) => {
const { id } = c.req.param();
const query = `DELETE FROM notes WHERE id = ?`;
await c.env.DB.prepare(query).bind(id).run();
await c.env.VECTOR_INDEX.deleteByIds([id]);
return c.status(204);
});
2.10. Text Chunking (Optional)
For large blocks of text, it's recommended to split them into smaller chunks. This helps the LLM aggregate relevant context more effectively without having to retrieve overly large text.
To implement this functionality, add the NPM package @langchain/textsplitters to your project:
- npm
- yarn
- pnpm
npm i @langchain/textsplitters
yarn add @langchain/textsplitters
pnpm add @langchain/textsplitters
The RecursiveCharacterTextSplitter class in this package can split text into smaller chunks. You can customize the configuration as needed; the default configuration is sufficient for most scenarios:
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
const text = "Some long piece of text...";
const splitter = new RecursiveCharacterTextSplitter({
// These can be customized to change the chunking size
// chunkSize: 1000,
// chunkOverlap: 200,
});
const output = await splitter.createDocuments([text]);
console.log(output); // [{ pageContent: 'Some long piece of text...' }]
To use this splitter, we'll update the workflow to split text into multiple chunks, then iterate through each chunk and execute subsequent workflow steps for each:
export class RAGWorkflow extends WorkflowEntrypoint {
async run(event, step) {
const env = this.env;
const { text } = event.payload;
let texts = await step.do("split text", async () => {
const splitter = new RecursiveCharacterTextSplitter();
const output = await splitter.createDocuments([text]);
return output.map((doc) => doc.pageContent);
});
console.log(
"RecursiveCharacterTextSplitter generated ${texts.length} chunks",
);
for (const index in texts) {
const text = texts[index];
const record = await step.do(
`create database record: ${index}/${texts.length}`,
async () => {
const query = "INSERT INTO notes (text) VALUES (?) RETURNING *";
const { results } = await env.DB.prepare(query).bind(text).run();
const record = results[0];
if (!record) throw new Error("Failed to create note");
return record;
},
);
const embedding = await step.do(
`generate embedding: ${index}/${texts.length}`,
async () => {
const embeddings = await env.AI.run("@cf/baai/bge-base-en-v1.5", {
text: text,
});
const values = embeddings.data[0];
if (!values) throw new Error("Failed to generate vector embedding");
return values;
},
);
await step.do(`insert vector: ${index}/${texts.length}`, async () => {
return env.VECTOR_INDEX.upsert([
{
id: record.id.toString(),
values: embedding,
},
]);
});
}
}
}
Now when large blocks of text are submitted to the /notes endpoint, they will be split into smaller chunks and processed separately by the workflow.
2.11. Deploy Your Project
If you didn't deploy your Worker in step 1, use Wrangler to deploy to a *.workers.dev subdomain or to a custom domain you've configured. If you haven't configured any subdomain or domain yet, Wrangler will prompt you to set one up during the publishing process.
npx wrangler deploy
You can preview your Worker at <YOUR_WORKER>.<YOUR_SUBDOMAIN>.workers.dev.
Note
When first pushing to a *.workers.dev subdomain, you may see 523 errors during DNS propagation. These errors usually resolve themselves within about a minute.
3. Test Results
3.1 Add a Note

3.2 Workflow Execution

3.3 Query the Knowledge Base

4. Conclusion
With this, we've completed a simple personal RAG application, which is one way to combine a knowledge base with AI.
5. Related Resources
All the code can be found on GitHub: demo-rag-ai-tutorial
Cloudflare also provides an official version: https://github.com/kristianfreeman/cloudflare-retrieval-augmented-generation-example/
Learn more:
- Check out the RAG Architecture Reference Diagram
- Read Cloudflare's AI Documentation
- Browse Tutorials to build projects on Workers
- Try Examples to copy and paste code for experimentation
- Learn how Workers work in the Reference
- Learn about Workers features and capabilities in Platform
- Set up Wrangler to programmatically create, test, and deploy your Worker projects