Text Preprocessing and Cleaning
Last updated
Last updated
After uploading content user can choose different tools for chunking, indexing and segmenting the data.
Vord provides an automatic tool for chunking data, but users can also customize it for added convenience.
Indexing is necessary for for accurate data retrieval. There are 2 types of indexing on Vord, and each has their own retrieval method:
High Quality
Economical
In this type the system first leverages an configurable Embedding model (which can be switched) to convert chunk text into numerical vectors. This process facilitates efficient compression and persistent storage of large-scale textual data, while simultaneously enhancing the accuracy of LLM-user interactions.
This mode allow users to choose from 3 different types of retrieval methods:
Vector Search: The system vectorizes the user's input query to generate a query vector. It then computes the distance between this query vector and the text vectors in the knowledge base to identify the most semantically proximate text chunks.
Full-Text Search: Indexing all terms in the document, allowing users to query any terms and return text fragments containing those terms.
Hybrid: This process performs both full-text search and vector search simultaneously, incorporating a reordering step to select the best results that match the user's query from both types of search outcomes.
Economical mode employs an offline vector engine and keyword indexing, which reduces accuracy but eliminates additional token consumption and associated costs. The indexing method is limited to inverted indexing.
This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved. The system will also dynamically adjust the value of Top K, according to max_tokens of the selected model.