TABLE OF CONTENTS

Introduction The Dataset Our Approach Results Conclusions

Introduction The Dataset Our Approach Results

Introduction The Dataset Our Approach Results Conclusions References

Introduction The Dataset Our Approach Results

Introduction The Dataset Our Approach

Introduction The Dataset Our Approach Results Conclusions References

Introduction The Dataset Our Approach Results Conclusion References

Introduction The Dataset Our Approach Results Conclusions

Introduction The Dataset The Dataset

Introduction The Dataset Our Approach Results Conclusion References

Introduction The Dataset Our Approach Results Conclusion

Introduction The Dataset Our Approach Results Conclusion References

Introduction The Dataset Our Approach Results Conclusion

Introduction

Conversational agents, such as chatbots, personal assistants, and language interfacing operating systems, are currently seeing rapid development. One specific area of interest is conversational agents that utilize retrieval-augmented generation (RAG), which imbues LLMs with long-term memory. However, popular Question-Answering (QA) benchmarks that typically test RAG systems, focus primarily on information retrieval (IR) from a static database of texts, such as Wikipedia. The increasing importance of conversational agents raises the question of how to address the unique challenges RAG models face in conversational contexts that they do not in offline, database retrieval contexts.

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

We found that there is currently a lack of benchmarks that directly test models on both of these challenges simultaneously in conversational contexts. Further, current conversational LLMs that use standard vector database retrieval methods do not directly address these challenges.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

We created an open source dataset by augmenting an existing long-form chat dataset known as LoCoMo [1]. We used the open-source chat logs from the LoCoMo dataset and created three types of questions based on the chat-log.

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Vector database retrieval retrieves text chunks based on their semantic similarity to the query text. Vector search is the standard for RAG chatbots, but will not alone work for the sorts of questions in our dataset, which require retrieval based on the meta-data of the text. To deal with this issue we combine vector database search with a tabular search method known as chain-of-table.

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Results

First, we store chat text in a table, where each row represents one response. Columns represent meta-data and store an index to the response’s associated semantic vector. Our retrieval algorithm performs the following steps to retrieve text

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Introduction

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Results

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Conclusion

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

We see that for both GPT3.5 and a Mistral-7b model, our method achieves almost perfect recall and very high F2 precision in the items it retrieves from memory, showing both that our method is retrieving the correct text from memory (recall) without retrieving extra irrelevant items (F2). This is unlike existing semantic-based retrieval methods which perform poorly since they are unable to handle time related metadata correctly.

For ambiguous queries, we find that augmenting this method with state of the art query rewriting methods, allows the model to perform almost as well as it did in the unambiguous questions case (see table 2).

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Results

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

Conclusion

An important capability for conversational agents is an understanding of when conversational events occurred in the past and an ability to handle ambiguous queries, which are commonplace in conversational contexts. We believe our dataset can be used as a basis to test how well various IR methods understand temporal and conversational context, and our empirical results suggest that our approach that combines chain-of-table and semantic retrieval methods makes a useful starting point.

References

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

Results

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

Our Approach

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

The Dataset

Our Approach

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Results

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

Conclusion

References

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Results

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Conclusion

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

References

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Link to Cookbook (GitHub)

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Results

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

Conclusion

References

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

What is Annealing?

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Results

Conclusion

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

The Dataset

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

Our Approach

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Results

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Conclusion

References

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Our Approach

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

Table 1: Evaluation scores for Zyda-2 vs alternative datasets broken down more granularly by specific evaluation metric

Results

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Conclusion

References

Analysis of Global Duplicates

We present histograms depicting distribution of cluster sizes in all the datasets (see Fig. 7-11). Please, note that all the figures are in log-log scale. We see a significant drop in the number of clusters starting from the size of around 100. This drop is present both in DCLM and FineWeb-Edu2 (see Fig. 8 and 9 respectively), and most likely is explained by a combination of the deduplication strategy and quality when creating both datasets: DCLM deduplication was done individually within 10 shards, while FineWeb-Edu2 was deduplicated within every Common Crawl snapshot. We find that large clusters usually contain low quality material (repeated advertisements, license agreements templates, etc), so it’s not surprising that such documents were removed. Notably, DCLM still contained one cluster with the size close to 1 million documents, containing low quality documents seemingly coming from the advertisements (see Appendix).We find both Zyda-1and Dolma-CC contain a small amount of duplicates, which is expected, since both datasets were deduplicated globally by their authors. Remaining duplicates are likely false negatives from the initial deduplication procedure. Note, that distribution of duplicates clusters sizes of these two datasets (Fig. 10 and 11) don’t contain any sharp drops, but rather hyper exponentially decreases with cluster size.

Figure 7: Distribution of cluster sizes of duplicates in global dataset (log-log scale).

Figure 8: Distribution of cluster sizes of duplicates in DCLM (log-log scale).

Figure 9: Distribution of cluster sizes of duplicates in FineWeb-Edu2 (log-log scale).

Figure 10: Distribution of cluster sizes of duplicates in Zyda-1 (log-log scale).

Figure 11: Distribution of cluster sizes of duplicates in Dolma-CC (log-log scale).

Largest cluster in DCLM

Below is an example of the document from the largest cluster (~1M documents) of duplicates in DCLM (quality score 0.482627):
‍
‍Is safe? Is scam?
Is safe for your PC?
Is safe or is it scam?
Domain is SafeSafe score: 1
‍‍
‍The higher the number, the more dangerous the website.Any number higher than 1 means DANGER.
‍‍
‍Positive votes:
Negative votes:
Vote Up Vote Down review
‍‍
‍Have you had bad experience with Warn us, please!

Examples of varying quality score in a cluster of duplicates in DCLM

Below one will find a few documents with different quality scores from DCLM coming from the same duplicates cluster. Quality score varies from ~0.2 to ~0.04.

Document ID: <urn:uuid:941f22c0-760e-4596-84fa-0b21eb92b8c4>

Quality score of: 0.19616

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, Rovi

Document ID: <urn:uuid:0df10da5-58b8-44d8-afcb-66aa73d1518b>

Quality score of: 0.091928

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
Sean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:4986ef09-3ee3-4e13-9084-7898aaf72aaf>

Quality score of: 0.072259

recent on-air advertisers

Now Playing

You Control the ...

Artist Snapshot:

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally grouped in as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of the music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jocky artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which lay clearly beyond the boundaries of each. Perhaps best described as simply experimental, Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dick (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Elliot Dick left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize. ~ Sean Cooper, RoviSean Cooper, Rovi
‍
More Rome
‍
You may also like...

Document ID: <urn:uuid:1e0496a9-0116-418a-9aec-e65b1d20e709>

Quality score of: 0.0424

18 June 2015

ROME self titled 1996

by request

Artist Biography by

Thrill Jockey instrumental duo Rome are, like many of the acts on the Chicago-based independent label, generally categorized as loose adherents of "post-rock," a period-genre arising in the mid-'90s to refer to rock-based bands utilizing the instruments and structures of music in a non-traditionalist or otherwise heavily mutated fashion. Unlike other Thrill Jockey artists such as Tortoise and Trans-Am, however, Rome draw less obviously from the past, using instruments closely associated with dub (melodica, studio effects), ambient (synthesizers, found sounds), industrial (machine beats, abrasive sounds), and space music (soundtrack-y atmospherics), but fashioning from them a sound which clearly lies beyond the boundaries of each. Perhaps best described as simply "experimental," Rome formed in the early '90s as the trio of Rik Shaw (bass), Le Deuce (electronics), and Elliot Dicks (drums). Based in Chicago, their Thrill Jockey debut was a soupy collage of echoing drums, looping electronics, and deep, droning bass, with an overwhelmingly live feel (the band later divulged that much of the album was the product of studio jamming and leave-the-tape-running-styled improvisation). Benefiting from an early association with labelmates Tortoise as representing a new direction for American rock, Rome toured the U.S. and U.K. with the group (even before the album had been released), also appearing on the German Mille Plateaux label's tribute compilation to French philosopher Gilles Deleuze, In Memoriam. Although drummer Dicks left the group soon after the first album was released, Shaw and Deuce wasted no time with new material, releasing the "Beware Soul Snatchers" single within weeks of its appearance. An even denser slab of inboard studio trickery, "Soul Snatchers" was the clearest example to date of the group's evolving sound, though further recordings failed to materialize.
‍
1 Leaving Perdition 8:10
2 Intermodal 3:39
3 Lunar White 3:25
4 She's A Black Belt 3:14
5 Rohm 1:09
6 Radiolucence (Version) 5:31
7 Deepest Laws 14:14

No comments:

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

Reported scores underlined.

Pass@1 scores with greedy sampling.

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Pass@1 scores with greedy sampling. Livebench 2024-11-25.
Bold: Best score at 1.5B scale w/ greedy sampling
*reported scores

Evals (reported underlined). All numbers pass@1 estimated using n=16

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

The Dataset

Our Approach

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Results

Footnote: Training on the Eurus-2-RL dataset did not match the DeepScaleR math evaluation numbers, possibly due to lower quality synthetic math questions in NuminaMath-CoT providing a mixed training signal, or the solvability filtering process with QwQ-preview reducing the difficulty of the dataset. Additionally, the relatively small percentage of code (5%) likely led to math dominating training at the expense of code performance. Training on domain specific datasets and merging resulting models seems to be a potential way to counteract this problem, as demonstrated with SFT in Light-R1.

Conclusion

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

The Dataset

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Our Approach

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

Results

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Conclusion

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Results

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Conclusion

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.

References

[1] Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

Introduction

There seem to be two crucial challenges conversational agents face that are not tested in most standard database retrieval benchmarks:

Conversational Meta-Data Based Queries: In conversational contexts, a common sort of query refers to meta-data (e.g. time, date, or speaker) associated with previous conversations. For example, one could plausibly ask "what were we discussing yesterday morning, again?", "what was that idea we were working on last time?", or "summarize what Jason talked about in our meeting from January 6th.". These questions are not specifying what was talked about, but are instead asking the model to specify what was talked about given some meta-data, such as the time of the conversational event.
Ambiguous Questions: In conversation, it is normal to speak with pronouns (he, she, it, they, etc.) and demonstratives ('this', 'that', etc.), which are ambiguous without an understanding of preceding conversational context. Although understanding this context is often trivial when this information is stored inside an LLM's context, such statements will fool the naive RAG systems as we discuss below.

To address these challenges, we at Zyphra

created a dataset that directly tests a retriever’s ability to retrieve relevant items from a chat log based on their meta-data, such as the time or data of the particular response.
developed an approach that combines both vector database search, SoTA table database search, and query rewriting to deal with the sort of questions presented in this dataset, as well as more standard datasets.

The Dataset

Time-Based Queries: These sorts of questions refer directly, and only, to the timing of a conversational event. Examples include, “What did we talk about last Tuesday?”, “Summarize what we discussed on January 6th.”

Ambiguous Time Queries: These are ambiguous forms of the time-based queries. These begin by referencing the relevant time in conversation, then later asking a question like “Could you summarize what we discussed on that date?”
Meta-Data and Content Based Queries: These questions refer both to meta-data (.e.g, the time, or speaker) of chat text and the content of the text. These questions are set up so both sources of information must be used to accurately retrieve the correct text. For example, a question might be “In session 8, what did Josh say he did to relax?”

Our Approach

Prompt #1

I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #2

The emperor's complexion did not change, remaining as still as a sculpture, and a touch of touching warmth flashed in his eyes. He deeply glanced at the loyal minister, and finally spoke: "Well, I will consider it again." His voice was low and firm, leaving a faint hint of helplessness and tenderness in the air.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #3

You don't even think to call me "Godfather." You come into my house on the day my daughter is to be married and you ask me to do murder - for money.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #4

Brave bakers boldly baked big batches of brownies in beautiful bakeries.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #5

Active artists always appreciate artistic achievements and applaud awesome artworks.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #6

I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right?

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #7

F one F two F four F eight H sixteen H thirty two H sixty four

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #8

Its chlorover. Like totally chlorover. Totally. Completely. Chlorover.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Prompt #9

Crafting a symphony of flavors the skilled chef orchestrated a culinary masterpiece that left an indelible mark mark mark mark mark on the palates of the discerning diners.

Zonos

ElevenLabs

Cartesia

Fish Speech v1.5

Results

Use the LLM to decide whether the query is referring to meta-data.
If it is, then the model performs a SoTA LLM tabular search method known as Chain-of-Table (CoTable) to retrieve rows from the table with relevant meta-data. Then the LLM decides if a semantic search is also needed. If so it performs a semantic search of the text, otherwise it returns the retrieved text.
If the query is not referring to meta-data, the model does normal semantic search.

To deal with ambiguous queries, we also adapt SoTA query writing methods to our algorithm with promising results.

Conclusion

We find our CoTable+Semantic retrieval method significantly outperforms standard semantic retrieval methods.