Skip to main content

Sidekicks Document Q&A

One of the common use cases for a Fixie Sidekick is to answer questions about a collection of documents, such as web pages or PDF files. By providing a Large Language Model with information drawn from a set of documents, the Sidekick can do an excellent job answering questions and having a conversation with the user about the contents of those documents. This approach is called Retrieval-Augmented Generation, or RAG, and is a popular way of using LLMs.

Fixie Sidekicks makes it easy to do this, as the Fixie service can handle all of the details of crawling documents, building embedding, managing a vector database, and all of the other machinery involved in developing a RAG-based application. Here, we're going to show you how to do this in Fixie.

Concepts

The fundamental concept behind RAG in Fixie is the idea of a Corpus, which represents a collection of documents that can be queried. You create a Corpus from one or more Sources, where a Source defines a set of rules for where to gather documents from, and how to process those documents when adding them to the Corpus. For example, a Web Source would define a set of URLs and crawling rules for gathering a set of web pages, while a PDF Source might have a set of PDF files that have been uploaded directly to Fixie.

Fixie deals with the process of fetching the document contents, processing them, and storing the data in a way that can later be used by the LLM running in your Sidekick to answer questions. Once a Corpus has been built, that is, populated with processed documents, you can query the Corpus to get back the most relevant information from the documents in the Corpus to that query. This information can then be fed back into the LLM to generate a final response to the query.

How RAG works in Fixie

While you don't need to fully understand the details of how Fixie implements RAG, it is helpful to understand the process at a high level.

In Fixie, you create a Corpus and add one or more Sources to it. Once a Source has been added to a Corpus, Fixie takes care of fetching the documents from that Source (which might involve crawling web pages and following links). Documents are then processed in various ways -- for example, HTML contents are converted to Markdown, and raw text is extracted from PDF files.

Next, the processed documents are then converted into a set of Chunks. Each Chunk represents a small piece of the document contents -- usually a few kilobytes in size -- that the Large Language Model can process independently of the rest of the document. Fixie uses an LLM to generate a set of vector embeddings for each Chunk. Think of an embedding as a numerical representation of the contents of the Chunk. Embeddings are generated by feeding the Chunk through a Large Language Model and extracting the vector representation of the hidden state of the model. The contents of each Chunk, along with its embedding, are stored in a vector database, which makes it possible to, later, fetch the set of Chunks that are most relevant to a user query.

Say we have stored a bunch of Chunks in the vector database representing pieces of documents about, say, foxes. One Chunk might contain:

There are 37 species of foxes, but only 12 belong to the genus “vulpes” or “true foxes”. The best known of them are: Red Fox, Arctic Fox, Kit Fox, Fennec Fox, and the Gray Fox.

and another Chunk might contain:

The distribution areas for the fox are quite remarkable, and that is why they have a legacy out there like no other animal has. They are found throughout areas of North America.

If the user were to query the Corpus with "What are the various species of foxes?", Fixie first generates the vector embedding for the query itself. It then uses this query vector to find the Chunks in the vector database that are semantically "closest" to the user's query. In the case of this particular query, the first Chunk above would be ranked higher than the second Chunk. The Corpus query result is a list of Chunks, ranked in order from most to least relevant to the provided query.

However, just getting back the Chunks, by themselves, is not very useful. Usually, you will want to pass the Chunks, along with the user query, back through the Large Language Model to provide a final answer to the user's question. The LLM can then generate a response using both its internal knowledge as well as that drawn from the data provided in the Chunks.

Don't worry! All of the steps above are handled automatically for you by the Fixie platform and the Sidekicks SDK, as we'll explain below.

Creating a Corpus

The first step is to create an empty Corpus. You can do this on the Fixie Console by navigating to "Documents" and clicking on "New Document Collection". You can also do this via the Fixie CLI:

$ npx fixie corpus create "My test corpus"

This will give you back a Corpus ID, which you will need for the later steps, so jot it down:

{
"corpus": {
"corpusId": "66cd8b74-155f-45c8-83ed-28814ae7be89",
"displayName": "My test corpus",
"created": "2023-09-15T23:55:06.863527Z",
"modified": "2023-09-15T23:55:06.891968Z",
"stats": {
"status": "CORPUS_STATUS_EMPTY"
}
}
}

You can use npx fixie corpus list to see the list of corpora you have created.

Example corpus

Rather than creating a corpus from scratch, you can also query the existing corpus with ID 44094d5a-f817-4c2e-a2a4-8f8a0c936d0f from your own code. This corpus contains a bunch of web pages with exciting facts about foxes, so it makes for a good place to test out the query functionality!

Adding Sources to the Corpus

The next step is to add one or more Sources to the Corpus. Again, you can do this via the Fixie web UI, or using the CLI:

$ npx fixie corpus sources add 66cd8b74-155f-45c8-83ed-28814ae7be89 https://en.wikipedia.org/wiki/Fox

Here, the string 66cd8b74... is the corpus ID returned when you first created the Corpus.

This adds a single web page Source to the Corpus. A Corpus can have multiple Sources, or a single Source can have multiple web pages -- this is just a simple example. See the Fixie API Documentation for more details on the underlying APIs.

The above command creates a Source with one web page. If you wanted to include more documents -- say, all of the pages linked to by the above Wikipedia page, but from the same domain -- you could use:

$ npx fixie corpus sources add \
66cd8b74-155f-45c8-83ed-28814ae7be89 \
https://en.wikipedia.org/wiki/Fox \
--max-depth 1 \
--include-patterns 'https://en.wikipedia.org/**'

Here, we're indicating that the Source should include only web pages within one "hop" from the original page, and only with URLs that match the glob pattern provided on the command line.

You could, in principle, create a Source which crawls, say, the entire web -- but Fixie imposes a default cap on the maximum number of documents that can be included in a single Source, currently set to 200.

Checking the status of the Corpus

Be patient!

Note that building a Corpus from a Source can take a while, depending on the number of documents, web pages, etc. defined by the Source. Processing a single web page or document might take only a couple of minutes, but a large web crawl could take many hours.

You can check on the progress of your Corpus build by running npx fixie corpus get <corpus-id>, like so:

❯ npx fixie corpus get 44094d5a-f817-4c2e-a2a4-8f8a0c936d0f
{
"corpus": {
"corpusId": "44094d5a-f817-4c2e-a2a4-8f8a0c936d0f",
"displayName": "FoxesWorld.com",
"created": "2023-09-15T20:15:45.549089Z",
"modified": "2023-09-15T20:15:45.581481Z",
"stats": {
"status": "CORPUS_STATUS_READY",
"lastUpdated": "2023-09-15T21:11:53.788270Z",
"numChunks": "762"
}
}
}

This shows that the corpus status is CORPUS_STATUS_READY and has 762 chunks ready to be queried!

Understanding the API

You can check out the Fixie API Documentation at docs.fixie.ai, which describes each of the fields in the API responses in detail.

Querying the corpus

You can run a manual corpus query on the Fixie Console, by navigating to the page for the Corpus you want to test, an selecting the Query Test tab. Alternately, you can use the npx fixie corpus query command:

❯ npx fixie corpus query 44094d5a-f817-4c2e-a2a4-8f8a0c936d0f "What is a fennec fox?"
{
"results": [
{
"chunkContent": "Fennec Fox - Fox Facts and Information ...",
"score": 0.897228,
"sourceId": "29c0999a-821b-4888-b838-49d7c028a667",
"documentId": "6aa96dd7-5bdb-468b-b7f6-04702246ebd5"
}
]
}

The query operation returns the set of chunks that are most relevant to the query, according to the vector database. The chunkContent field contains the contents of the chunk, score is the similiary measure, and the sourceId and documentId fields identify the source and document that the chunk came from.

Typically, the chunks returned from a query will be used in a subsequent call to an LLM to generate a final response to a question.

Using Documents in your Sidekick

Okay! Now that we have the fundamentals out of the way, we can show you how to add the ability to query a corpus of documents from your Sidekick. Fortunately, it's very easy, once you have the corpus built.

Step 1: Create a Sidekick

You've already done this, we hope, as part of the Quickstart.

Step 2: Create a FixieCorpus

You can directly query a corpus from an AI.JSX app using the FixieCorpus class to query the corpus, and a <DocsQA> element in AI.JSX to use the results of the corpus query to answer a question. Here is a simple example of a function that returns an AI.JSX component that queries a given corpus:

import { FixieCorpus } from 'ai-jsx/batteries/docs';

function DoCorpusQuery() {
const corpus = new FixieCorpus("44094d5a-f817-4c2e-a2a4-8f8a0c936d0f");
const question = 'What is a Fennec fox?'
const results = await corpus.search(query, { limit: 4 });
return (
<DocsQA question={question} corpus={corpus} chunkLimit={5} />
);

The fixie-sidekick-template example uses a slightly different approach, in which the FixieCorpus.createTool() method is used to provide a Tool to the AI.JSX <Sidekick> component to query the corpus. This is functionally the same as the use of <DocsQA> above.