Skip to main content

Interface: CorpusLoadRequest

batteries/docs.CorpusLoadRequest

A request to load documents for a corpus. During loading, the corpus will issue a series of such requests to the loader in order to populate itself.

In addition to returning documents, each response may expand the corpus space in one or both of two dimensions:

  1. Responses may include new partitions to be loaded. Partitions are non-overlapping subsets of a corpus which may be loaded in parallel. A response's new partitions will be ignored if previously included in another response.
  2. When a response includes a page of documents, that page may indicate that another page is available in the same partition. Pages are always loaded serially in order. The partition is completed when a response has a page with no next page token.

Loading will always begin with a request with the default (unnamed) partition and no page token. Subsequent requests depend on prior responses and will always include at least one of those fields.

Examples:

  • Simple handful of documents: The response to the initial request contains only a page of documents. This could include a next page token for more documents in the single default partition if needed.

  • Web crawl: Each URL corresponds to a partition and the responses never include tokens. The initial response only includes partitions, one for each root URL to crawl. Each subsequent request includes the partition (the URL) and the corresponding response contains a page with a single document - the resource at that URL. If the document links to other resources that should be included in the corpus, then the response also contains those URLs as new partitions. The process repeats for all partitions until there are no known incomplete partitions (or until crawl limits are reached).

  • Database: Consider a database with a parent table keyed by parent_id and an interleaved child table keyed by (parent_id, child_id) whose rows correspond to corpus documents. This loader will use tokens that encode a read timestamp (for consistency) and an offset to be used in combination with a static page size.

      Upon receiving the initial request, the loader chooses a commit timestamp to use for
    all reads and returns a partition for each parent_id along with a first page token
    indicating the chosen read timestamp and an offset of 0.

    For each partition, the loader then receives requests with the partition (a parent_id)
    and a page token (the read timestamp and offest). It responds with documents
    corresponding to the next page size child rows within the given parent. If more
    children exist, the response includes a next page token with the same read timestamp
    and an incremented offset. This repeats until there are no more children, at which
    point the response has no next page token and the partition is complete.

    Note: Including multiple parent_ids in each partition would also work and would be an
    effective way to limit parallelism if desired.

Properties

pageToken

Optional Readonly pageToken: string

A token for paginating results within a corpus partition. If present, this will be echoed from a previous response.

Defined in

packages/ai-jsx/src/batteries/docs.tsx:169


partition

Optional Readonly partition: string

The partition of the corpus that should be read. This will be empty for the initial request, indicating the default partition. For subsequent requests, it will typically be the name of a partition returned by a previous request, though it could be empty if the default partition contains multiple pages.

Defined in

packages/ai-jsx/src/batteries/docs.tsx:163