After uploading content to the knowledge base, it needs to undergo chunking and data cleaning. This stage can be understood as content preprocessing and structuring.
What is text chunking and cleaning?
Chunking: LLMs have a limited context window, usually requiring the entire text to be segmented and then recalling the most relevant segments to the user’s question, known as the segment TopK recall mode. Additionally, appropriate segment sizes help match the most relevant text content and reduce information noise when semantically matching user questions with text segments.
Cleaning: To ensure the quality of text recall, it is usually necessary to clean the data before passing it into the model. For example, unwanted characters or blank lines in the output may affect the quality of the response. To help users solve this problem, Vord provides various cleaning methods to help clean the output before sending it to downstream applications, check ETL to know more details.
Two strategies are supported:
Automatic mode
Custom mode
Automatic
The Automated mode is designed for users unfamiliar with segmentation and preprocessing techniques. In this mode, Vord automatically segments and sanitizes content files, streamlining the document preparation process.
Custom
Custom mode is tailored for advanced users with specific text processing requirements. This mode allows manual configuration of chunking rules and cleaning strategies based on different document formats and scenario demands.
Chunking Rules:
Delimiter: Specify a delimiter for text segmentation. For example, (newline character in regex) will chunk text at each line break.
Maximum chunk length: Set the maximum character count per segment. Chunk exceeding this limit will be forcibly divided. The maximum length for a segment is 4000 tokens.
Chunk overlap: Define the overlap between adjacent chunks. This overlap enhances information retention and analysis accuracy, improving recall effectiveness. Recommended setting is 10-25% of the segment length in tokens.
Text Preprocessing Rules: These rules help filter out insignificant content from the knowledge base.
Replace consecutive spaces, newlines, and tabs.
Delete all URLs and email addresses.
Indexing Mode
You need to choose the indexing method for the text to specify the data matching method. The indexing strategy is often related to the retrieval method, and you need to choose the appropriate retrieval settings according to the scenario.
High-Quality Mode
Economical Mode
Q&A Mode
In High-Quality mode, the system first leverages an configurable Embedding model (which can be switched) to convert chunk text into numerical vectors. This process facilitates efficient compression and persistent storage of large-scale textual data, while simultaneously enhancing the accuracy of LLM-user interactions.
The High-Quality indexing method offers three retrieval settings: vector retrieval, full-text retrieval, and hybrid retrieval. For more details on retrieval settings, please check "Retrieval Settings".
This mode employs an offline vector engine and keyword indexing, which reduces accuracy but eliminates additional token consumption and associated costs. The indexing method is limited to inverted indexing. For detailed specifications, please refer to the section below.
When documents are uploaded to the knowledge base, the system segments the text and generates Q&A pairs for each segment through summarization. Unlike the "Q to P" (Question to Paragraph) strategy employed in High-Quality and Economy modes, the QA mode adopts a "Q to Q" (Question to Question) approach.
This methodology is preferred because question texts typically exhibit complete grammatical structures in natural language. The Q to Q matching paradigm facilitates clearer chunk understanding and matching, while also accommodating scenarios involving high-frequency and highly similar queries.
Upon user inquiry, the system identifies the most semantically similar question and returns the corresponding text segment as the answer. This method offers enhanced precision as it directly matches against user queries, thereby more accurately retrieving the information users genuinely require.
Retrieval Settings
In high-quality indexing mode, Vord offers three retrieval settings:
Vector Search
Full-Text Search
Hybrid Search
Vector Search
Definition: The system vectorizes the user's input query to generate a query vector. It then computes the distance between this query vector and the text vectors in the knowledge base to identify the most semantically proximate text chunks.
Vector Search Settings:
Rerank Model: After configuring the API key for the Rerank model on the "Model Provider" page, you can enable the “Rerank Model” in the retrieval settings. The system will then perform semantic reordering of the retrieved document results after hybrid retrieval, optimizing the ranking results. Once the Rerank model is established, the TopK and Score Threshold settings will only take effect during the reranking step.
TopK: This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Score Threshold: This parameter sets the similarity threshold for filtering text chucks. Only text chucks that exceed the specified score will be recalled. By default, this setting is off, meaning there will be no filtering of similarity values for recalled text chucks. When enabled, the default value is 0.5. A higher value is likely to yield fewer recalled texts.
The TopK and Score configurations are only effective during the Rerank phase. Therefore, to apply either of these settings, it is necessary to add and enable a Rerank model.
Full-Text Search
Definition: Indexing all terms in the document, allowing users to query any terms and return text fragments containing those terms.
Rerank Model: After configuring the API key for the Rerank model on the "Model Provider" page, you can enable the “Rerank Model” in the retrieval settings. The system will then perform semantic reordering of the retrieved document results after hybrid retrieval, optimizing the ranking results. Once the Rerank model is established, the TopK and Score Threshold settings will only take effect during the reranking step.
TopK: This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Score Threshold: This parameter sets the similarity threshold for filtering text chucks. Only text chucks that exceed the specified score will be recalled. By default, this setting is off, meaning there will be no filtering of similarity values for recalled text chucks. When enabled, the default value is 0.5. A higher value is likely to yield fewer recalled texts.
The TopK and Score configurations are only effective during the Rerank phase. Therefore, to apply either of these settings, it is necessary to add and enable a Rerank model.
Hybrid Search
Definition: This process performs both full-text search and vector search simultaneously, incorporating a reordering step to select the best results that match the user's query from both types of search outcomes. In this mode, users can specify "weight settings" without needing to configure the Rerank model API, or they can opt for a Rerank model for retrieval.
Weight Settings: This feature enables users to set custom weights for semantic priority and keyword priority. Keyword search refers to performing a full-text search within the knowledge base, while semantic search involves vector search within the knowledge base.
Semantic Value of 1
This activates only the semantic search mode. Utilizing embedding models, even if the exact terms from the query do not appear in the knowledge base, the search can delve deeper by calculating vector distances, thus returning relevant content. Additionally, when dealing with multilingual content, semantic search can capture meaning across different languages, providing more accurate cross-language search results.
Keyword Value of 1
This activates only the keyword search mode. It performs a full match against the input text in the knowledge base, suitable for scenarios where the user knows the exact information or terminology. This approach consumes fewer computational resources and is ideal for quick searches within a large document knowledge base.
Custom Keyword and Semantic Weights
In addition to enabling only semantic search or keyword search, we provide flexible custom weight settings. You can continuously adjust the weights of the two methods to identify the optimal weight ratio that suits your business scenario.
Rerank Model: After configuring the API key for the Rerank model on the "Model Provider" page, you can enable the “Rerank Model” in the retrieval settings. The system will then perform semantic reordering of the retrieved document results after hybrid retrieval, optimizing the ranking results. Once the Rerank model is established, the TopK and Score Threshold settings will only take effect during the reranking step.
The "Weight Settings" and "Rerank Model" settings support the following options:
TopK: This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Score Threshold: This parameter sets the similarity threshold for filtering text chucks. Only text chucks that exceed the specified score will be recalled. By default, this setting is off, meaning there will be no filtering of similarity values for recalled text chucks. When enabled, the default value is 0.5. A higher value is likely to yield fewer recalled texts.
In the Economical indexing mode, Vord offers a single retrieval setting:
Inverted Index:
An inverted index is an index structure designed for rapid keyword retrieval in documents. Its fundamental principle involves mapping keywords from documents to lists of documents containing those keywords, thereby enhancing search efficiency. For a detailed explanation of the underlying mechanism, please refer to the "Inverted Index".
TopK:
This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Reference
Optional ETL Configuration
In production-level applications of RAG, to achieve better data recall, multi-source data needs to be preprocessed and cleaned, i.e., ETL (extract, transform, load). To enhance the preprocessing capabilities of unstructured/semi-structured data, Vord supports optional ETL solutions: Vord ETL and Unstructured ETL.
Unstructured can efficiently extract and transform your data into clean data for subsequent steps.
ETL solution choices in different versions of Vord:
The SaaS version defaults to using Unstructured ETL and cannot be changed;
The community version defaults to using Vord ETL but can enable Unstructured ETL through environment variables;
Differences in supported file formats for parsing:
Different ETL solutions may have differences in file extraction effects. For more information on Unstructured ETL’s data processing methods, please refer to the official documentation.
Embedding Model
Embedding transforms discrete variables (words, sentences, documents) into continuous vector representations, mapping high-dimensional data to lower-dimensional spaces. This technique preserves crucial semantic information while reducing dimensionality, enhancing content retrieval efficiency.
Embedding models, specialized large language models, excel at converting text into dense numerical vectors, effectively capturing semantic nuances for improved data processing and analysis.