There has recently been growing interest in conversational agents with long-term memory which has led to the rapid development of language models that use retrieval-augmented generation (RAG). Until recently, most work on RAG has focused on information retrieval from large databases of texts, like Wikipedia, rather than information from long-form conversations. In this paper, we argue that effective retrieval from long-form conversational data faces two unique problems compared to static database retrieval: 1) time/event-based queries, which requires the model to retrieve information about previous conversations based on time or the order of a conversational event (e.g., the third conversation on Tuesday), and 2) ambiguous queries that require surrounding conversational context to understand. To better develop RAG-based agents that can deal with these challenges, we generate a new dataset of ambiguous and time-based questions that build upon a recent dataset of long-form, simulated conversations, and demonstrate that standard RAG based approaches handle such questions poorly. We then develop a novel retrieval model which combines chained-of-table search methods, standard vector-database retrieval, and a prompting method to disambiguate queries, and demonstrate that this approach substantially improves over current methods at solving these tasks. We believe that this new dataset and more advanced RAG agent can act as a key benchmark and stepping stone towards effective memory augmented conversational agents that can be used in a wide variety of AI applications.¹
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.
Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
Positive votes:
Negative votes:
Vote Up Vote Down review
Have you had bad experience with Warn us, please!
Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.
Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.
There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:
We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.
To address these challenges, we at Zyphra
We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.
Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.
First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text
To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.
We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.
We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.
For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).
An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.
[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.