Finding information in long documents with AI using vector databases and MapReduceChain from Langchain

AI works great when we want to extract information from documents. However, we quickly encounter problems. What if we need to find information in a large number of documents? We can’t pass them as part of the prompt one by one because it would take too long and cost too much. What if the documents are so long that they won’t fit in the prompt anyway?

The standard way of dealing with the issues is to split the documents into smaller chunks, calculate word embeddings for each piece, and store them in a vector database. Then, we can use the vector database to find the most similar chunks to the prompt. Finally, we pass the chunks to AI to extract the information we need.

Unfortunately, this approach has a few problems:

  • How do we split the document? We can’t split the document in a random location or create a new chunk every 400 characters. We would cut the sentences in half, and the AI would have a hard time understanding the text.
  • No matter what splitting strategy we choose, a chunk may be not enough to understand the context. We must find the source document of the most relevant chunk.
  • The source document may not fit the prompt because the text may be too long. We may need to split the document again and ask AI to extract the information from each chunk separately. How do we do it without losing the context?

In this article, I will show you how to solve these problems using Langchain. We will split the documents into paragraphs or sentences using RecursiveCharacterTextSplitter to preserve the context. We will use ParentDocumentRetriever to find the source document of the most relevant chunk. Finally, we will use MapReduceChain to pass a lengthy document to AI and extract information.

Required Dependencies

Before I start, I have to install the required dependencies:

  • langchain - a AI helper library
  • openai - the OpenAI API client
  • chromadb - a vector database
  • tiktoken - a BPE tokeniser
  • lark - the parsing library used by the self-querying feature of langchain

Storing Long Documents in a Vector Database

I have an array of essays written by one of my favorite authors — David Perell. The articles array contains a tuple with the page title and the text. The first thing I have to do is to create a Langchain Document from each essay. A document may have metadata in addition to text. I will use the metadata to store the title and the author. We can use the metadata later to filter the results.

from langchain.schema import Document

docs = []

for page_title, article in articles:
  [article_title, article_author] = page_title.split(' - ')
  document = Document(
      page_content=article,
      metadata={'author': article_author, 'title': article_title}
  )
  docs.append(document)

Now, I need several things. First is the vector database to store embeddings. I will use the Chroma vector database in the in-memory mode. The second thing is a document store to keep the text of documents. For the sake of an example, an in-memory store is enough.

Finally, I need a splitter to split the documents into smaller chunks. I will use RecursiveCharacterTextSplitter to split the documents into paragraphs or sentences. The recursive splitter tries to keep all paragraphs (and then sentences, and then words) together as long as possible. Of course, if the chunk is still too long and there is no other option, the text splitter will cut the words in the middle, but that’s the last resort.

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = Chroma(
    collection_name="articles",
    embedding_function=OpenAIEmbeddings(openai_api_key='sk-...')
)
store = InMemoryStore()

I have to put those things together using a ParentDocumentRetriever. The retriever stores the chunks in the vector database and the source document in the document store. However, the retriever adds metadata to each chunk with the ID of the source document. When we retrieve relevant documents, the ParentDocumentRetriever will find the chunk and use the metadata to return the entire document.

from langchain.retrievers import ParentDocumentRetriever

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=text_splitter,
)

Now, I can use the add_documents function to store the documents in the vector database and the document store. The function accepts two parameters: the documents and a list of document ids. The ids parameter may be None. In this case, the retriever will generate the ids automatically. I recommend passing the ids if you already have an id of the document (for example, a database id). Otherwise, pass None.

retriever.add_documents(docs, ids=None)

Retrieving Documents By Keywords

ParentDocumentRetriever can find documents whose vectors are similar to a given text. Therefore, it works well when we search by keywords, not when trying to find documents answering a question.

If I wanted to find David Perell’s hints on writing, I could ask the retriever to get the documents similar to the prompt “writing guidelines”:

found_documents = retriever.get_relevant_documents("writing guidelines")

for document in found_documents:
  print(document.metadata['title'])

The code prints “The Ultimate Guide to Writing Online” and “Why You Should Write” — two excellent essays on writing.

The keyword-based search would work great in a use case similar to Google search. However, AI got us used to chatbot interactions. If I asked, “What are David Perell’s hints on writing?” the vector database would return too many documents. Of course, if I passed them through AI later, AI would skip the irrelevant ones, but it would be a waste of resources. I need a better approach.

Retrieving Documents By Questions

What if I could pass an entire question to the retriever, and it would manage to understand what I want? Lucky for us, the SelfQueryRetriever implements exactly this use case. The SelfQueryRetriever will use AI to convert the question into a query for the vector database. Then, the retriever will use the query to find the most relevant documents.

I will need an LLM, a description of the documents in the database, and a description of the document metadata:

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo


metadata_field_info=[
    AttributeInfo(
        name="title",
        description="The title of the essay",
        type="string",
    ),
    AttributeInfo(
        name="author",
        description="The author of the essay",
        type="string",
    ),
]
document_content_description = "An entire essay about any topic"
llm = OpenAI(temperature=0, max_tokens=500, openai_api_key='sk-...')

retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

Let’s check if the metadata filtering works. I ask a question that’s impossible to answer with the data in the database:

found_documents = retriever.get_relevant_documents({"query": "What are Bartosz Mikulski's hints on writing?"})
len(found_documents)

In the output, we see the SelfQueryRetriever understood the question, parsed the query, and constructed a filter for the document metadata. Most importantly, it returned zero documents. As expected.

query='hints on writing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='author', value='Bartosz Mikulski') limit=None
0

Now, I can ask a question about David Perell’s hints on writing:

found_documents = retriever.get_relevant_documents({"query": "What are David Perell's hints on writing?"})
len(found_documents)

I get a relevant metadata filter, and the retriever finds four documents:

query='hints on writing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='author', value='David Perell') limit=None
4

To be precise, it finds four chunks. Unfortunately, the SelfQueryRetriever doesn’t return the full documents like ParentDocumentRetriever does. We can copy the relevant part of the code from the ParentDocumentRetriever implementation and get the same behavior:

def get_full_documents(docstore, sub_docs):
  ids = []
  for d in sub_docs:
      if d.metadata['doc_id'] not in ids:
          ids.append(d.metadata['doc_id'])
  docs = docstore.mget(ids)
  return [d for d in docs if d is not None]


found_full_documents = get_full_documents(store, found_documents)
for doc in found_full_documents:
  print(doc.metadata['title'])

This time, my code returns “Why You Should Write,” “The Ultimate Guide to Writing Online,” and “Imitate, then Innovate.”

A slightly different query (“hints on writing” vs. “writing guidelines”) returns different results. Therefore, in production, I would add a step where AI paraphrases the user’s question to generate several versions. Each version might yield different results, and I could combine them to get the best possible answer.

Using AI to Find Infomation in Long Documents

Returning three long essays to the user is a viable option, but summarizing them and producing a paragraph-long answer would be better. Unfortunately, most of David Perell’s essays won’t fit in a single prompt.

I will need to split each document into chunks, find the quotes that answer the question in each chunk, combine them into a single document, and ask AI to summarize it. Fortunately, I don’t have to write the code because such behavior is implemented by MapReduceChain.

In the next example, I write a prompt and pass a text splitter to chunk the document. As a result, I get a list of quotes.

from langchain import PromptTemplate
from langchain.chains.mapreduce import MapReduceChain


def find_answer(question, article):
    prompt_template = PromptTemplate.from_template(
      """
      Use the following article to answer the user's question.
      Answer by returning bullet points with relevant quotes from the article.
      Start with bullet points. Don't include any header. Don't include a footer either.

      Question: {question}

      Article:
      ---
      {input_text}
      ---
      """
    )
    chain = MapReduceChain.from_params(
      llm=llm,
      prompt=prompt_template,
      text_splitter=RecursiveCharacterTextSplitter(chunk_size=4000),
      # required only if the prompt uses multiple variables:
      reduce_chain_kwargs={"document_variable_name": "input_text"},
      combine_chain_kwargs={"document_variable_name": "input_text"}
    )
    return chain.run(input_text=article, question=question)

partial_responses = [find_answer("What are David Perell's hints on writing?", article.page_content) for article in found_full_documents]

In the last step, I will pass them to AI again to get the final answer.

from langchain import LLMChain


input_text = "\n\n".join(partial_answers)

prompt_template = PromptTemplate.from_template(
    """
    You will receive a question, and bullet points with quotes answering the question. Summarize them in 1-2 paragraphs.

    Question: {question}

    Answers: {answers}
    """
)
chain = LLMChain(prompt=prompt_template, llm=llm)
final_answer = chain.run(question=question, answers=input_text)

I got a pretty good summary of the three essays:

David Perell's hints on writing emphasize the importance of imitation and innovation. He encourages writers to imitate the work of those who have come before them, while also striving for originality. He suggests that writers should look to other fields for inspiration, and to consume art intentionally. Perell also encourages writers to embrace imitative learning for skills that are hard to put into words, and to listen for resistance in the imitation process to find their authentic artistic voice. He also suggests that writers should read a lot of good writing to hone their intuition for what quality writing feels like, and to make sure they are making useful contributions. Finally, Perell encourages writers to demand quality from themselves, to write with passion, and to promote their work.

Do you need help building AI-powered information retrieval system for your business?
You can hire me!

Older post

Building a classification service with Llama2 in Python

How to use the Llama2 AI model in Python to build a text classification service

Newer post

What to do when a document doesn't fit in AI prompt window

Using Langchain MapReduceChain to handle documents longer than the prompt limit

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Schedule a call, send me a message on LinkedIn, or use the chat button in the right-bottom corner. Schedule a call or send me a message on LinkedIn

>