Sidekicks Document Q&A
One of the common use cases for a Fixie Sidekick is to answer questions about a collection of documents, such as web pages or PDF files. By providing a Large Language Model with information drawn from a set of documents, the Sidekick can do an excellent job answering questions and having a conversation with the user about the contents of those documents. This approach is called Retrieval-Augmented Generation, or RAG, and is a popular way of using LLMs.
Fixie Sidekicks makes it easy to do this, as the Fixie service can handle all of the details of crawling documents, building embedding, managing a vector database, and all of the other machinery involved in developing a RAG-based application. Here, we're going to show you how to do this in Fixie.
Concepts
The fundamental concept behind RAG in Fixie is the idea of a Corpus, which represents a collection of documents that can be queried. You create a Corpus from one or more Sources, where a Source defines a set of rules for where to gather documents from, and how to process those documents when adding them to the Corpus. For example, a Web Source would define a set of URLs and crawling rules for gathering a set of web pages, while a PDF Source might have a set of PDF files that have been uploaded directly to Fixie.
Fixie deals with the process of fetching the document contents, processing them, and storing the data in a way that can later be used by the LLM running in your Sidekick to answer questions. Once a Corpus has been built, that is, populated with processed documents, you can query the Corpus to get back the most relevant information from the documents in the Corpus to that query. This information can then be fed back into the LLM to generate a final response to the query.
How RAG works in Fixie
While you don't need to fully understand the details of how Fixie implements RAG, it is helpful to understand the process at a high level.
In Fixie, you create a Corpus and add one or more Sources to it. Once a Source has been added to a Corpus, Fixie takes care of fetching the documents from that Source (which might involve crawling web pages and following links). Documents are then processed in various ways -- for example, HTML contents are converted to Markdown, and raw text is extracted from PDF files.
Next, the processed documents are then converted into a set of Chunks. Each Chunk represents a small piece of the document contents -- usually a few kilobytes in size -- that the Large Language Model can process independently of the rest of the document. Fixie uses an LLM to generate a set of vector embeddings for each Chunk. Think of an embedding as a numerical representation of the contents of the Chunk. Embeddings are generated by feeding the Chunk through a Large Language Model and extracting the vector representation of the hidden state of the model. The contents of each Chunk, along with its embedding, are stored in a vector database, which makes it possible to, later, fetch the set of Chunks that are most relevant to a user query.
Say we have stored a bunch of Chunks in the vector database representing pieces of documents about, say, foxes. One Chunk might contain:
There are 37 species of foxes, but only 12 belong to the genus “vulpes” or “true foxes”. The best known of them are: Red Fox, Arctic Fox, Kit Fox, Fennec Fox, and the Gray Fox.
and another Chunk might contain:
The distribution areas for the fox are quite remarkable, and that is why they have a legacy out there like no other animal has. They are found throughout areas of North America.
If the user were to query the Corpus with "What are the various species of foxes?", Fixie first generates the vector embedding for the query itself. It then uses this query vector to find the Chunks in the vector database that are semantically "closest" to the user's query. In the case of this particular query, the first Chunk above would be ranked higher than the second Chunk. The Corpus query result is a list of Chunks, ranked in order from most to least relevant to the provided query.
However, just getting back the Chunks, by themselves, is not very useful. Usually, you will want to pass the Chunks, along with the user query, back through the Large Language Model to provide a final answer to the user's question. The LLM can then generate a response using both its internal knowledge as well as that drawn from the data provided in the Chunks.
Don't worry! All of the steps above are handled automatically for you by the Fixie platform and the Sidekicks SDK, as we'll explain below.
Creating a Corpus
The first step is to create an empty Corpus. You can do this on the Fixie Console by navigating to "Documents" and clicking on "New Document Collection". You can also do this via the Fixie CLI:
$ npx fixie corpus create "My test corpus"
This will give you back a Corpus ID, which you will need for the later steps, so jot it down:
{
"corpus": {
"corpusId": "66cd8b74-155f-45c8-83ed-28814ae7be89",
"displayName": "My test corpus",
"created": "2023-09-15T23:55:06.863527Z",
"modified": "2023-09-15T23:55:06.891968Z",
"stats": {
"status": "CORPUS_STATUS_EMPTY"
}
}
}
You can use npx fixie corpus list
to see the list of corpora you have created.
Rather than creating a corpus from scratch, you can also query the existing
corpus with ID 44094d5a-f817-4c2e-a2a4-8f8a0c936d0f
from your own code.
This corpus contains a bunch of web pages with exciting facts about foxes, so it
makes for a good place to test out the query functionality!
Adding Sources to the Corpus
The next step is to add one or more Sources to the Corpus. Again, you can do this via the Fixie web UI, or using the CLI:
$ npx fixie corpus sources add 66cd8b74-155f-45c8-83ed-28814ae7be89 https://en.wikipedia.org/wiki/Fox
Here, the string 66cd8b74...
is the corpus ID returned when you first created the Corpus.
This adds a single web page Source to the Corpus. A Corpus can have multiple Sources, or a single Source can have multiple web pages -- this is just a simple example. See the Fixie API Documentation for more details on the underlying APIs.
The above command creates a Source with one web page. If you wanted to include more documents -- say, all of the pages linked to by the above Wikipedia page, but from the same domain -- you could use:
$ npx fixie corpus sources add \
66cd8b74-155f-45c8-83ed-28814ae7be89 \
https://en.wikipedia.org/wiki/Fox \
--max-depth 1 \
--include-patterns 'https://en.wikipedia.org/**'
Here, we're indicating that the Source should include only web pages within one "hop" from the original page, and only with URLs that match the glob pattern provided on the command line.
You could, in principle, create a Source which crawls, say, the entire web -- but Fixie imposes a default cap on the maximum number of documents that can be included in a single Source, currently set to 200.
Checking the status of the Corpus
Note that building a Corpus from a Source can take a while, depending on the number of documents, web pages, etc. defined by the Source. Processing a single web page or document might take only a couple of minutes, but a large web crawl could take many hours.
You can check on the progress of your Corpus build by running npx fixie corpus get <corpus-id>
,
like so:
❯ npx fixie corpus get 44094d5a-f817-4c2e-a2a4-8f8a0c936d0f
{
"corpus": {
"corpusId": "44094d5a-f817-4c2e-a2a4-8f8a0c936d0f",
"displayName": "FoxesWorld.com",
"created": "2023-09-15T20:15:45.549089Z",
"modified": "2023-09-15T20:15:45.581481Z",
"stats": {
"status": "CORPUS_STATUS_READY",
"lastUpdated": "2023-09-15T21:11:53.788270Z",
"numChunks": "762"
}
}
}
This shows that the corpus status is CORPUS_STATUS_READY
and has 762 chunks ready to be
queried!
You can check out the Fixie API Documentation at docs.fixie.ai, which describes each of the fields in the API responses in detail.
Querying the corpus
You can run a manual corpus query on the Fixie Console, by navigating to the page
for the Corpus you want to test, an selecting the Query Test tab. Alternately,
you can use the npx fixie corpus query
command:
❯ npx fixie corpus query 44094d5a-f817-4c2e-a2a4-8f8a0c936d0f "What is a fennec fox?"
{
"results": [
{
"chunkContent": "Fennec Fox - Fox Facts and Information ...",
"score": 0.897228,
"sourceId": "29c0999a-821b-4888-b838-49d7c028a667",
"documentId": "6aa96dd7-5bdb-468b-b7f6-04702246ebd5"
}
]
}
The query operation returns the set of chunks that are most relevant to the query,
according to the vector database. The chunkContent
field contains the contents of
the chunk, score
is the similiary measure, and the sourceId
and documentId
fields
identify the source and document that the chunk came from.
Typically, the chunks returned from a query will be used in a subsequent call to an LLM to generate a final response to a question.
Using Documents in your Sidekick
Okay! Now that we have the fundamentals out of the way, we can show you how to add the ability to query a corpus of documents from your Sidekick. Fortunately, it's very easy, once you have the corpus built.
Step 1: Create a Sidekick
You've already done this, we hope, as part of the Quickstart.
Step 2: Create a FixieCorpus
You can directly query a corpus from an AI.JSX app using the FixieCorpus
class to query
the corpus, and a <DocsQA>
element in AI.JSX to use the results of the corpus query to
answer a question. Here is a simple example of a function that returns an AI.JSX
component that queries a given corpus:
import { FixieCorpus } from 'ai-jsx/batteries/docs';
function DoCorpusQuery() {
const corpus = new FixieCorpus("44094d5a-f817-4c2e-a2a4-8f8a0c936d0f");
const question = 'What is a Fennec fox?'
const results = await corpus.search(query, { limit: 4 });
return (
<DocsQA question={question} corpus={corpus} chunkLimit={5} />
);
The fixie-sidekick-template example
uses a slightly different approach, in which the FixieCorpus.createTool()
method is used
to provide a Tool
to the AI.JSX <Sidekick>
component to query the corpus. This is
functionally the same as the use of <DocsQA>
above.