Generate questions and answers from any document using AI

Have you ever read the terms of service? It would be great to know what is in the document without reading the entire text. Let’s use AI to generate questions and answers from HuggingFace terms of service.

Of course, it’s a terrible idea to read an automatically generated summary instead of the entire legal document, so please read the Terms of Service, Privacy Policy, Content Policy, and Code of Conduct if you are going to use HuggingFace. But we can still use the terms of service as an example for a tutorial.

Terms of service are long documents. If we have a model with a large enough prompt window, we can pass the entire document to the model. But even if HuggingFace terms of service fit in the prompt, I bet Oracle or Microsoft’s terms of service would not fit.

Therefore, the first problem is splitting the text into smaller chunks. We can’t break the document at arbitrary points because we would cut words in half. We need to split the text at the end of the paragraph or the end of the sentence. If that’s not possible, we can break the text at the end of the word.

The requirement sounds tricky, but Langchain has a built-in text splitter for this purpose. It’s called RecursiveCharacterTextSplitter. The splitter splits the document into chunks but tries to preserve entire paragraphs or, at least, sentences or words.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=4000)

Let’s assume we have the terms of service already loaded into the content variable. We can split the content into chunks using the splitter:

chunks = splitter.split_text(content)

Now, we can pass the chunks to the DoctranQATransformer. However, the transformer requires a Langchain Document, not text, so first, we have to put the chunks into a document:

from langchain.schema import Document

documents = [Document(page_content=chunk) for chunk in chunks]

We are ready to generate Q&A. We need to pass the OpenAI API key and the model name to the transformer:

from langchain.document_transformers import DoctranQATransformer


qa_transformer = DoctranQATransformer(openai_api_key='sk-...', openai_api_model='gpt-3.5-turbo')
qa_documents = await qa_transformer.atransform_documents(documents)

The transformer preserves the given document content but adds the Q&A as the document metadata. We can combine the lists of questions into a single list like this:

qa = []
for document in qa_documents:
  qa.extend(document.metadata['questions_and_answers'])

Is there anything interesting in the HuggingFace terms of service? Let’s see. This one looks important:

{
    'question': 'Are the fees inclusive of taxes?',
    'answer': 'All fees are exclusive of any applicable taxes, which you are solely responsible to pay. '
},

Is your AI hallucinating in production? Take my 10-minute AI Readiness Assessment to identify critical vulnerabilities or schedule a consultation.

Generate questions and answers from any document using AI

What to do when a document doesn't fit in AI prompt window

Save time and money by caching OpenAI (and other LLM) API calls with Langchain

Generate questions and answers from any document using AI

What to do when a document doesn't fit in AI prompt window

Save time and money by caching OpenAI (and other LLM) API calls with Langchain

Related Posts

Multilingual RAG: Does Query-Doc Language Mismatch Matter?

Fix AI Pipeline Hallucinations: 95% Accurate Data Extraction in One Day

How to Detect and Block AI Hallucinations in Chatbots