Document field extractor node (Public preview)

Use the document field extractor node to extract specific fields from your documents.

This feature is currently in public preview. Functionality and behavior may change in future updates.

Pre-requisites

Run the following command to enable watsonx Orchestrate Developer Edition to process documents:

BASH

orchestrate server start -e <.env file path> -d

Note: You need to configure a minimum allocation of 20GB RAM to your Docker engine during installation of watsonx Orchestrate Developer edition to support document processing features.

Note: To run the document field extractor, you must define the WATSONX_SPACE_ID, WATSONX_APIKEY, and WATSONX_PROJECT_ID credentials in your .env file. For more information on configuring the .env file, see Installing the watsonx Orchestrate Developer Edition.

Configuring document extractor node in agentic workflows

Define the fields to extract. Create a class that defines the fields you want to extract. Each field must follow this structure:

Python

field: DocExtConfigField = Field(name="Field name", default=DocExtConfigField(name="Field name", field_name="field_name"))

Class example:

Python

class Fields(BaseModel):
    """
    Configuration schema for document extraction fields.
    
    Defines the fields to be extracted from contract documents, including
    their names, types, and descriptions. Each field is configured with
    a DocExtConfigField that specifies how the document extractor should
    identify and extract the information.
    
    In this example, we define a number of custom fields for a Contract or 
    Agreement document:
        buyer: The purchasing party in the contract
        seller: The selling party in the contract
        agreement_date: The date when the agreement was signed (date type)
        agreement_number: Unique identifier for the contract
        contract_type: Classification of the contract 
    """
    buyer: DocExtConfigField = Field(
        name="Buyer",
        default=DocExtConfigField(
            name="Buyer",
            field_name="buyer"
        )
    )

Configure the document extract node

In your agentic workflow, include a call to the docext() method to extract an field from a document. This method accepts the following input arguments:

name

string

required

Unique identifier for the node.

llm

string

required

The LLM used for field extraction. The default value is watsonx/meta-llama/llama-3-2-90b-vision-instruct.

display_name

string

Display name for the node.

fields

object

required

The fields you want to extract.

description

string

Description of the node.

input_map

DataMap

Define input mappings using a structured collection of Assignment objects.

enable_hw

bool

Enable the handwritten feature by setting this to true.

min_confidence

float

The minimum acceptable confidence for an extracted field value.

review_fields

List[string]

The fields that require user review.

enable_review

bool

Enables or disables the human-in-the-loop feature. Set to True to activate it and False to deactivate. The default value is False.

field_extraction_method

string

Selects the Document Extractor runtime. The default value is classic, which uses the Unstructured Document Extractor. To use the Structured Document Extractor, set the value to layout.

Note:The min_confidence and review_fields settings control the human-in-the-loop feature. This feature only works when you run the Flow from a chat session. If a field is extracted with confidence lower than min_confidence, and its name appears in review_fields, the agent opens a review window in the chat. You can then review and confirm the extracted values.

The input to a docext node is expected to be of type DocExtInput, from the module ibm_watsonx_orchestrate.flow_builder.types. Example use of the docext node in a agentic workflow:

Python

from pydantic import BaseModel, Field
from ibm_watsonx_orchestrate.flow_builder.flows import (
    Flow, flow, START, END
)
from ibm_watsonx_orchestrate.flow_builder.types import DocExtConfigField, DocumentProcessingCommonInput


class Fields(BaseModel):
    """
    Configuration schema for document extraction fields.
    
    Defines the fields to be extracted from contract documents, including
    their names, types, and descriptions. Each field is configured with
    a DocExtConfigField that specifies how the document extractor should
    identify and extract the information.
    
    In this example, we define a number of custom fields for a Contract or 
    Agreement document:
        buyer: The purchasing party in the contract
        seller: The selling party in the contract
        agreement_date: The date when the agreement was signed (date type)
        agreement_number: Unique identifier for the contract
        contract_type: Classification of the contract 
    """
    buyer: DocExtConfigField = Field(
        name="Buyer",
        default=DocExtConfigField(
            name="Buyer",
            field_name="buyer"
        )
    )
    
    seller: DocExtConfigField = Field(
        name="Seller",
        default=DocExtConfigField(
            name="Seller",
            field_name="seller"
        )
    )
    
    agreement_date: DocExtConfigField = Field(
        name="Agreement date",
        default=DocExtConfigField(
            name="Agreement Date",
            field_name="agreement_date",
            type="date"
        )
    )
    
    agreement_number: DocExtConfigField = Field(
        name="Agreement number",
        default=DocExtConfigField(
            name="Agreement Number",
            field_name="agreement_number",
            description="The identifier of this contract."
        )
    )
    
    contract_type: DocExtConfigField = Field(
        name="Contract type",
        default=DocExtConfigField(
            name="Contract Type",
            field_name="contract_type",
            type="string",
            description="The type of contract between the buyer and the seller."
        )
    )


@flow(
    name ="custom_flow_docext_example",
    display_name="custom_flow_docext_example",
    description="Extraction of custom fields from a document, specified by the user.",
    input_schema=DocumentProcessingCommonInput
)
def build_docext_flow(aflow: Flow = None) -> Flow:
    # aflow.docext returns 2 objects: the document extractor node and the schema of the extracted values.
    # In this example, doc_ext_node is the node and is added to the flow.
    # _ExtractedValues is the output schema of doc_ext_node and can be used as the input schema of nodes downstream in the flow.

    doc_ext_node, _ExtractedValues = aflow.docext(
        name="contract_extractor",
        display_name="Extract fields from a contract",
        description="Extracts fields from an input contract file",
        llm="watsonx/meta-llama/llama-3-2-90b-vision-instruct",
        fields=Fields()
    )

    aflow.sequence(START, doc_ext_node, END)
    return aflow

Release Notes

Get Started

Build

Analyze

watsonx Orchestrate Developer Edition

watsonx Orchestrate MCP Server

Reference

Legal notices

Document field extractor node (Public preview)

Pre-requisites

Configuring document extractor node in agentic workflows

Release Notes

Get Started

Build

Analyze

watsonx Orchestrate Developer Edition

watsonx Orchestrate MCP Server

Reference

Legal notices

​Pre-requisites

​Configuring document extractor node in agentic workflows

Pre-requisites

Configuring document extractor node in agentic workflows