Have you ever read the terms of service? It would be great to know what is in the document without reading the entire text. Let’s use AI to generate questions and answers from HuggingFace terms of service.

Of course, it’s a terrible idea to read an automatically generated summary instead of the entire legal document, so please read the Terms of Service, Privacy Policy, Content Policy, and Code of Conduct if you are going to use HuggingFace. But we can still use the terms of service as an example for a tutorial.

Terms of service are long documents. If we have a model with a large enough prompt window, we can pass the entire document to the model. But even if HuggingFace terms of service fit in the prompt, I bet Oracle or Microsoft’s terms of service would not fit.

Therefore, the first problem is splitting the text into smaller chunks. We can’t break the document at arbitrary points because we would cut words in half. We need to split the text at the end of the paragraph or the end of the sentence. If that’s not possible, we can break the text at the end of the word.

The requirement sounds tricky, but Langchain has a built-in text splitter for this purpose. It’s called RecursiveCharacterTextSplitter. The splitter splits the document into chunks but tries to preserve entire paragraphs or, at least, sentences or words.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=4000)

Let’s assume we have the terms of service already loaded into the content variable. We can split the content into chunks using the splitter:

chunks = splitter.split_text(content)

Now, we can pass the chunks to the DoctranQATransformer. However, the transformer requires a Langchain Document, not text, so first, we have to put the chunks into a document:

from langchain.schema import Document

documents = [Document(page_content=chunk) for chunk in chunks]

We are ready to generate Q&A. We need to pass the OpenAI API key and the model name to the transformer:

from langchain.document_transformers import DoctranQATransformer

qa_transformer = DoctranQATransformer(openai_api_key='sk-...', openai_api_model='gpt-3.5-turbo')
qa_documents = await qa_transformer.atransform_documents(documents)

The transformer preserves the given document content but adds the Q&A as the document metadata. We can combine the lists of questions into a single list like this:

qa = []
for document in qa_documents:

Is there anything interesting in the HuggingFace terms of service? Let’s see. This one looks important:

    'question': 'Are the fees inclusive of taxes?',
    'answer': 'All fees are exclusive of any applicable taxes, which you are solely responsible to pay. '

Do you need help building AI-powered information retrieval system for your business?
You can hire me!

Older post

What to do when a document doesn't fit in AI prompt window

Using Langchain MapReduceChain to handle documents longer than the prompt limit

Newer post

Save time and money by caching OpenAI (and other LLM) API calls with Langchain

How to use Langchain model response and document embeddings caching to save time and money when using Large Language Models