langchain Featured

Goodbye Web Forms, Hello Chat Messages

A Guide to Using OpenAI Functions and Langchain to Extract Structured Data from LLM App Conversations

Daniel Chalef

Jul 6, 2023 • 9 min read

Web forms play a fundamental role in many consumer and business applications. Consider software like Salesforce CRM, Expensify, Jira, or e-commerce platforms. These products utilize forms to gather user data and do work, solving the problems they were designed for.

When it comes to AI chat applications, the story is different. Collecting needed data is more complicated because we’re working with unstructured text messages instead of structured forms.

This article outlines a method for extracting data from these unstructured chat messages. It involves using simple schemas defined by developers. The method makes data extraction easy and helps direct the LLM to gather necessary data while adding little complexity to the app.

At a high level, the approach is as follows:

Define model schemas using Pydantic that describe the data we need to extract from a conversation. In our example, these schemas model an e-commerce transaction with a Person, OrderItem, Address classes, and a parent Order class.
Create a custom Langchain Memory class that takes as input our Pydantic Order class. When provided with a new human message, the custom Memory class will use an OpenAI function and the JSON Schema representation of our Order class to extract from the human’s message values described by this schema.
Define a template that we will use to direct the LLM to complete the e-commerce transaction. We will add a custom order_details input field to the template, into which we will add the current state of the populated Order model. This helps guide the LLM through the data collection, ensuring it requests fields it does not yet have.
As the conversation between the LLM agent and human progresses, we will merge newly extracted schema field values into a single instance of our Pydantic model class.

I’m using Pydantic and Langchain to demonstrate this approach. It is entirely possible to do the same using other frameworks and languages.

💡

The source code for this article may be found in the Zep By Example repository.

Let’s first start with our schema class, ExtractorSchema.

class ExtractorSchema(BaseModel, ABC):
    """Base class for extractor models."""

    def recursive_update(self, model_new: ExtractorSchema | None) -> None:
        """Recursively update model values."""
        if model_new is None:
            return

        for name, field in self.__annotations__.items():
            new_field_value = getattr(model_new, name, None)
            old_field_value = getattr(self, name)

            if isinstance(old_field_value, ExtractorSchema):
                if new_field_value is not None and new_field_value != {}:
                    old_field_value.recursive_update(new_field_value)
            else:
                if new_field_value is not None and new_field_value != "":
                    setattr(self, name, new_field_value)

ExtractorSchema extends Pydantic’s BaseModel, which allows us to define model Fields, potentially validate them, and offers a method that outputs a JSON Schema representation of the model. We’ll use this schema representation when making the OpenAI Function call.

The class has a single method that recursively merges new instances of the model, output from our OpenAI function call, into the calling instance.

Below are two instances of the ExtractorSchema class: Order and Person. Order is a parent class that has as fields the Person, OrderItem, and Address classes. We’ve defined and described the fields for each class. The LLM will use our descriptions, alongside the field types, to extract these values from the chat messages.

class Order(ExtractorSchema):
    person: Person | None = Field(description="the person ordering the shoes")
    item: OrderItem | None = Field(description="the item being ordered")
    shipping_address: Address | None = Field(
        description="the person's shipping address"
    )

class Person(ExtractorSchema):
    first_name: str | None = Field(description="the human's first name", max_length=100)
    last_name: str | None = Field(description="the human's last name", max_length=100)
    email: str | None = Field(
        description="the human's email address",
        max_length=100,
    )
    phone: str | None = Field(description="the human's phone number", max_length=20)

Note how we’ve typed the fields as being nullable (or None, in Python). Pydantic is a validating parser and expects field values to be present when instantiating a class. If some fields are empty, Pydantic will raise an exception. From a user experience standpoint, we don’t want the LLM to ask for all the data at once, and inevitably humans will respond piecemeal. We therefore have to define the fields as optional.

Next, we instantiate a Langchain LLM instance using GPT-4-0613, which is fine-tuned for parsing JSON Schema and constructing JSON results. With this, we instantiate our custom Memory class (described below) and pass in our LLM and an instance of the Order schema class. The memory class will use this schema to parse the chat history.

llm_extraction = ChatOpenAI(temperature=0, model="gpt-4-0613")

memory = SchemaExtractorMemory(
    model_schema=Order(),
    llm=llm_extraction,
    input_key="input",
    memory_key="chat_history",
)

I’m using GPT-4 here as GPT-3.5-Turbo does not reliably follow the schema instructions for schemas as complex as ours. Your mileage may vary here.

I also create an LLM instance for the chat chain. Here, I can use GPT-3.5-Turbo to improve responsiveness and reduce costs. Our chat chain is a simple LLMChain class to which we pass our prompt. We also pass in the memory instance we created above.

sales_template = """
You are a shoe sales rep and your job is to assist humans with completing the purchase of a shoe.

In order to close a shoe sale, you need to know:
- the shoe style and size of shoe
- full name
- the human's email address
- the human's phone number
- the human's shipping address

IMPORTANT INSTRUCTIONS:
- You may only ask for one piece of information at a time. 
- Ensure that you collect all of the above information in order to close the sale. 
- Confirm the order details with the human before closing the sale.

This is what you already know about this order:
{order_details}

Here are the prior messages in this conversation:
{chat_history}

Human's response: {input}
"""

llm_chat = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")
llm_chain = LLMChain(
    llm=llm_chat,
    prompt=PromptTemplate.from_template(sales_template),
    memory=memory,
    verbose=True,
)

The prompt includes our shoe sales-specific instructions. Alongside the chat_history, we’re also providing the current state of the populated schema in the order_details field. This allows the LLM to compare this current state with its instructions, ensuring that it moves the conversation toward the population of all fields in the schema (and a closed sale).

For some applications, providing the chat_history field alongside the order_details field may be redundant, but in this instance, some nuance may be lost from the conversation by removing the history.

Our custom Langchain Memory class works as follows:

When instantiated, it creates a Langchain Pydantic extraction chain, passing our ExtractorSchema model to the chain. In our usage, we’re passing the Order class above.
When save_context is called by our chat chain, the extraction chain is called with the human’s most recent message. Whatever Order model values are extracted are then merged into an class variable of the Order model class. This allows us to populate model values as the chat progresses.

Note that Pydantic models, from which many Langchain classes are extended, including Memory, do not support instance variables, hence the use of a class variable here.

class SchemaExtractorMemory(ConversationBufferMemory):
    """Memory for extracting values from chat message inputs."""

    model: ExtractorSchema | None = None
    llm: ChatOpenAI | None = None
    chain: Chain | None = None

    def __init__(self, model_schema: ExtractorSchema, llm: ChatOpenAI, **kwargs: Any):
        super().__init__(**kwargs)

        self.model = model_schema
        self.llm = llm
        self.chain = create_extraction_chain_pydantic(
            pydantic_schema=model_schema, llm=llm
        )

    def _merge_model_values(
        self, model_values_new: ExtractorSchema, model_values_old: ExtractorSchema
    ) -> ExtractorSchema | None:
        """Merge new model values into old model values."""
        if model_values_old is not None and model_values_new is not None:
            model_values_old.recursive_update(model_values_new)

        return model_values_old

    def _get_first_model_value(
        self, model_values: list[ExtractorSchema] | ExtractorSchema
    ) -> ExtractorSchema | None:
        """Get first model value from list of model values."""
        if isinstance(model_values, list):  # Added check for list instance
            return (
                None
                if model_values is None or len(model_values) == 0
                else model_values[0]
            )
        else:
            return model_values

    def _extract_model_values(self, input_str: str) -> None:
        """Extract values from inputs to be used in model."""
        model_values = self.chain.run(input_str)

        model_values = self._get_first_model_value(model_values)  # type: ignore

        self.model = self._merge_model_values(model_values, self.model)

    def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, str]) -> None:
        """Save context from this conversation to buffer."""
        input_str, output_str = self._get_input_output(inputs, outputs)

        self._extract_model_values(input_str)

        self.chat_memory.add_user_message(input_str)
        self.chat_memory.add_ai_message(output_str)

I'm intentionally extending a Memory class here, rather than using more complex Langchain chain classes to implement the extraction. Using Sequential Chains or similar can limit our flexibility when building chat apps, while a Memory class is plug and play with most Langchain chain classes.

We then call our chat chain as follows:

llm_chain.run(
  {
    "input": "hello, I'm Jane!",
    "order_details": memory.model.dict(exclude_unset=True),
  }
)

Note how I’m passing in values for the prompt’s order_details field. These are the model values as a Python dictionary. When populated, the prompt may look something like the following excerpt. As mentioned above, I’ve found that providing this context is an effective way to guide the LLM through the data collection process.

You are shoe sales rep and your job is to assist humans with completing the
purchase of a shoe.


In order to close a shoe sale, you need to know:
<snip />


This is what you already know about this order:
{'person': {'first_name': 'Jane', 'last_name': '', 'email': '', 'phone': ''}, 'item': {'size': '', 'color': '', 'brand': '', 'quantity': '', 'style': ''}, 'shipping_address': {'street': '', 'city': '', 'state': '', 'zip': ''}}


Here are the prior messages in this conversation:
Human: hello, I'm Jane!
AI: Hello Jane! Thank you for reaching out. May I know which shoe style you are interested in?


Human's response: I'd like to buy a pair of Puma Suede Classics.

Let’s take a closer look at a call to the OpenAI API. The following is an extract of the JSON payload sent to OpenAI in the call. We’re telling the LLM that we have a function called information_extraction and that we want to pass the human’s message “hello, I'm Jane!”, to this function.

We then define the arguments to this function in the parameters field of the JSON extract below. This field is populated with the Order class rendered as a JSON Schema. The LLM is instructed to return the parameters to this function call adhering to the JSON Schema we’ve provided. Note how the JSON Schema includes our field descriptions and types, guiding the LLM on what data we expect.

<snip>
   {
      "content": "Extract and save the relevant entities mentioned in the following passage together with their properties.\n\nPassage:\nhello, I'm Jane!\n",
      "role": "user"
    }
 </snip>
<snip>
    "functions": [
      {
        "description": "Extracts the relevant information from the passage.",
        "name": "information_extraction",
        "parameters": {
          "properties": {
            "info": {
              "items": {
                "properties": {
                  
                  "person": {
                    "allOf": [
                      {
                        "properties": {
                          "email": {
                            "description": "the human's email address",
                            "maxLength": 100,
                            "title": "Email",
                            "type": "string"
                          },
                          "first_name": {
                            "description": "the human's first name",
                            "maxLength": 100,
                            "title": "First Name",
                            "type": "string"
                          },
 </snip>

Thanks to Pydantic and Langchain’s Pydantic extraction chain, the above complexity is abstracted away. What is returned is an instance of our Order class populated with values the LLM has extracted.

After the conversation below, the populated model would look as follows:

Human: hello, I'm Jane!
AI: Shoe Sales Rep: Hello Jane! Thank you for reaching out. May I know which shoe style you are interested in?
Human: I'd like to buy a pair of Puma Suede Classics.
AI: Great choice! May I know the size of the Puma Suede Classics you would like to purchase?
Human: I'd prefer them to be black.
AI: Thank you for letting me know your color preference. Just to clarify, you would like to purchase a pair of black Puma Suede Classics. May I know your shoe size?
Human: Yes. I'm a size 9
AI: Thank you for providing your shoe size, Jane. Now, may I have your full name, please?
Human: Jane Austin
AI: Thank you for providing your full name, Jane Austin. May I have your email address, please?
Human: [email protected]
AI: Thank you for providing your email address, Jane. May I have your phone number, please?
Human: 415-555-1234
AI: Thank you for providing your phone number, Jane. Lastly, may I have your shipping address, please?
Human: 555 Main St, San Francisco, CA 94555
AI: Thank you for providing your shipping address, Jane. Just to confirm, your shipping address is 555 Main St, San Francisco, CA 94555. Is that correct?
Human: Yes, that's correct.
AI: Wonderful! I have all the necessary information to complete your purchase, Jane. You have selected a pair of black Puma Suede Classics in size 9. Your full name is Jane Austin, and your email address is [email protected]. The phone number you provided is 415-555-1234. Lastly, your shipping address is 555 Main St, San Francisco, CA 94555.

{
    "person": {
        "first_name": "Jane",
        "last_name": "Austin",
        "email": "[email protected]",
        "phone": "415-555-1234"
    },
    "item": {
        "size": "9",
        "color": "black",
        "brand": "Puma",
        "quantity": "1",
        "style": "Suede Classics"
    },
    "shipping_address": {
        "street": "555 Main St",
        "city": "San Francisco",
        "state": "CA",
        "zip": "94555"
    }
}

The above is a simple and effective way to progressively gather data from a human to complete a task. There are however several improvements that would need to be made before putting this into production:

The extraction chain runs in the chat loop, with relatively slow GPT-4 calls increasing the latency to a response from the LLM. This can be frustrating to the user and detrimental to user experience. Running extraction asynchronously and in parallel to the chat loop would be better.
We’re not validating the data. Pydantic has powerful validation features that we’re not using. We’d want to devise a mechanism to validate the extracted values and prompt the LLM to request the human to correct them if validation fails.
There’s nothing stopping the LLM from closing the sale with incomplete data. In parallel to field validation above, it would be useful to have a mechanism prompting the LLM when the fields are incomplete or complete and the sales transaction can be closed.

Next steps

View the full source code in the Zep By Example repo.