Document field extractor node

Use the document field extractor node to extract specific fields from your documents.

Pre-requisites

Run the following command to enable watsonx Orchestrate Developer Edition to process documents:

BASH

orchestrate server start -e <.env file path> -d

Note: You need to configure a minimum allocation of 20GB RAM to your Docker engine during installation of watsonx Orchestrate Developer edition to support document processing features.

Note: To run the document field extractor, you must define the WATSONX_SPACE_ID, WATSONX_APIKEY, and WATSONX_PROJECT_ID credentials in your .env file. For more information on configuring the .env file, see Installing the watsonx Orchestrate Developer Edition.

Configuring document extractor node in agentic workflows

Define the fields to extract. Create a class that defines the fields you want to extract. Each field must follow this structure:

Python

field: DocExtConfigField = Field(name="Field name", default=DocExtConfigField(name="Field name", field_name="field_name"))

Class example:

Python

class Fields(BaseModel):
    """
    Configuration schema for document extraction fields.
    
    Defines the fields to be extracted from contract documents, including
    their names, types, and descriptions. Each field is configured with
    a DocExtConfigField that specifies how the document extractor should
    identify and extract the information.
    
    In this example, we define a number of custom fields for a Contract or 
    Agreement document:
        buyer: The purchasing party in the contract
        seller: The selling party in the contract
        agreement_date: The date when the agreement was signed (date type)
        agreement_number: Unique identifier for the contract
        contract_type: Classification of the contract 
    """
    buyer: DocExtConfigField = Field(
        name="Buyer",
        default=DocExtConfigField(
            name="Buyer",
            field_name="buyer"
        )
    )

Configure the document extract node

In your agentic workflow, include a call to the docext() method to extract an field from a document. This method accepts the following input arguments:

Parameter	Type	Required	Description
name	string	Yes	Unique identifier for the node.
llm	string	Yes	The LLM used for field extraction. The default value is `groq/openai/gpt-oss-120b`.
display_name	string	No	Display name for the node.
fields	object	Yes	The fields you want to extract.
description	string	No	Description of the node.
input_map	DataMap	No	Define input mappings using a structured collection of Assignment objects.
enable_hw	bool	No	Enable the handwritten feature by setting this to `true`.
min_confidence	float	No	The minimum acceptable confidence for an extracted field value.
review_fields	List[string]	No	The fields that require user review.
enable_review	bool	No	Enables or disables the human-in-the-loop feature. Set to `True` to activate it and `False` to deactivate. The default value is `False`.

Note:The min_confidence and review_fields settings control the human-in-the-loop feature. This feature only works when you run the Flow from a chat session. If a field is extracted with confidence lower than min_confidence, and its name appears in review_fields, the agent opens a review window in the chat. You can then review and confirm the extracted values.

The input to a docext node is expected to be of type DocExtInput, from the module ibm_watsonx_orchestrate.flow_builder.types. Example use of the docext node in a agentic workflow:

Python

from pydantic import BaseModel, Field
from ibm_watsonx_orchestrate.flow_builder.flows import (
    Flow, flow, START, END
)
from ibm_watsonx_orchestrate.flow_builder.types import DocExtConfigField, DocumentProcessingCommonInput


class Fields(BaseModel):
    """
    Configuration schema for document extraction fields.
    
    Defines the fields to be extracted from contract documents, including
    their names, types, and descriptions. Each field is configured with
    a DocExtConfigField that specifies how the document extractor should
    identify and extract the information.
    
    In this example, we define a number of custom fields for a Contract or 
    Agreement document:
        buyer: The purchasing party in the contract
        seller: The selling party in the contract
        agreement_date: The date when the agreement was signed (date type)
        agreement_number: Unique identifier for the contract
        contract_type: Classification of the contract 
    """
    buyer: DocExtConfigField = Field(
        name="Buyer",
        default=DocExtConfigField(
            name="Buyer",
            field_name="buyer"
        )
    )
    
    seller: DocExtConfigField = Field(
        name="Seller",
        default=DocExtConfigField(
            name="Seller",
            field_name="seller"
        )
    )
    
    agreement_date: DocExtConfigField = Field(
        name="Agreement date",
        default=DocExtConfigField(
            name="Agreement Date",
            field_name="agreement_date",
            type="date"
        )
    )
    
    agreement_number: DocExtConfigField = Field(
        name="Agreement number",
        default=DocExtConfigField(
            name="Agreement Number",
            field_name="agreement_number",
            description="The identifier of this contract."
        )
    )
    
    contract_type: DocExtConfigField = Field(
        name="Contract type",
        default=DocExtConfigField(
            name="Contract Type",
            field_name="contract_type",
            type="string",
            description="The type of contract between the buyer and the seller."
        )
    )


@flow(
    name ="custom_flow_docext_example",
    display_name="custom_flow_docext_example",
    description="Extraction of custom fields from a document, specified by the user.",
    input_schema=DocumentProcessingCommonInput
)
def build_docext_flow(aflow: Flow = None) -> Flow:
    # aflow.docext returns 2 objects: the document extractor node and the schema of the extracted values.
    # In this example, doc_ext_node is the node and is added to the flow.
    # _ExtractedValues is the output schema of doc_ext_node and can be used as the input schema of nodes downstream in the flow.

    doc_ext_node, _ExtractedValues = aflow.docext(
        name="contract_extractor",
        display_name="Extract fields from a contract",
        description="Extracts fields from an input contract file",
        llm="groq/openai/gpt-oss-120b",
        fields=Fields()
    )

    aflow.sequence(START, doc_ext_node, END)
    return aflow

Release Notes

Get Started

Build

Analyze

watsonx Orchestrate Developer Edition

watsonx Orchestrate ADK MCP Server

Reference

Legal notices

Pre-requisites

Configuring document extractor node in agentic workflows

Release Notes

Get Started

Build

Analyze

watsonx Orchestrate Developer Edition

watsonx Orchestrate ADK MCP Server

Reference

Legal notices

​Pre-requisites

​Configuring document extractor node in agentic workflows

Pre-requisites

Configuring document extractor node in agentic workflows