---
title: "Generate questions and answers from any document using AI"
description: "How to use OpenAI GPT models, Langchain, and Doctran to generate questions and answers from long documents"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/use-ai-to-generate-questions-and-answers
---

Have you ever read the terms of service? It would be great to know what is in the document without reading the entire text. Let's use AI to generate questions and answers from HuggingFace terms of service.

Of course, **it's a terrible idea to read an automatically generated summary instead of the entire legal document**, so please read the Terms of Service, Privacy Policy, Content Policy, and Code of Conduct if you are going to use HuggingFace. But we can still use the terms of service as an example for a tutorial.

Terms of service are long documents. If we have a model with a large enough prompt window, we can pass the entire document to the model. But even if HuggingFace terms of service fit in the prompt, I bet Oracle or Microsoft's terms of service would not fit.

Therefore, the first problem is splitting the text into smaller chunks. We can't break the document at arbitrary points because we would cut words in half. We need to split the text at the end of the paragraph or the end of the sentence. If that's not possible, we can break the text at the end of the word.

The requirement sounds tricky, but Langchain has a built-in text splitter for this purpose. It's called `RecursiveCharacterTextSplitter`. The splitter splits the document into chunks but tries to preserve entire paragraphs or, at least, sentences or words.

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=4000)
```

Let's assume we have the terms of service already loaded into the `content` variable. We can split the content into chunks using the splitter:

```python
chunks = splitter.split_text(content)
```

Now, we can pass the chunks to the `DoctranQATransformer`. However, the transformer requires a Langchain `Document`, not text, so first, we have to put the chunks into a document:

```python
from langchain.schema import Document

documents = [Document(page_content=chunk) for chunk in chunks]
```

We are ready to generate Q&A. We need to pass the OpenAI API key and the model name to the transformer:

```python
from langchain.document_transformers import DoctranQATransformer

qa_transformer = DoctranQATransformer(openai_api_key='sk-...', openai_api_model='gpt-3.5-turbo')
qa_documents = await qa_transformer.atransform_documents(documents)
```

The transformer preserves the given document content but adds the Q&A as the document metadata. We can combine the lists of questions into a single list like this:

```python
qa = []
for document in qa_documents:
  qa.extend(document.metadata['questions_and_answers'])
```

Is there anything interesting in the HuggingFace terms of service? Let's see. This one looks important:

```python
{
    'question': 'Are the fees inclusive of taxes?',
    'answer': 'All fees are exclusive of any applicable taxes, which you are solely responsible to pay. '
},
```

