Skip to main content
Use the document field extractor node to extract specific fields from your documents.
This feature is currently in public preview. Functionality and behavior may change in future updates.

Pre-requisites

Run the following command to enable watsonx Orchestrate Developer Edition to process documents:
BASH
orchestrate server start -e <.env file path> -d
Note: You need to configure a minimum allocation of 20GB RAM to your Docker engine during installation of watsonx Orchestrate Developer edition to support document processing features.
Note: To run the document field extractor, you must define the WATSONX_SPACE_ID, WATSONX_APIKEY, and WATSONX_PROJECT_ID credentials in your .env file. For more information on configuring the .env file, see Installing the watsonx Orchestrate Developer Edition.

Configuring document extractor node in agentic workflows

  1. Define the fields to extract. Create a class that defines the fields you want to extract. Each field must follow this structure:
    Python
    field: DocExtConfigField = Field(name="Field name", default=DocExtConfigField(name="Field name", field_name="field_name"))
    
    Class example:
    Python
    class Fields(BaseModel):
        """
        Configuration schema for document extraction fields.
        
        Defines the fields to be extracted from contract documents, including
        their names, types, and descriptions. Each field is configured with
        a DocExtConfigField that specifies how the document extractor should
        identify and extract the information.
        
        In this example, we define a number of custom fields for a Contract or 
        Agreement document:
            buyer: The purchasing party in the contract
            seller: The selling party in the contract
            agreement_date: The date when the agreement was signed (date type)
            agreement_number: Unique identifier for the contract
            contract_type: Classification of the contract 
        """
        buyer: DocExtConfigField = Field(
            name="Buyer",
            default=DocExtConfigField(
                name="Buyer",
                field_name="buyer"
            )
        )
    
  2. Configure the document extract node
In your agentic workflow, include a call to the docext() method to extract an field from a document. This method accepts the following input arguments:
ParameterTypeRequiredDescription
namestringYesUnique identifier for the node.
llmstringYesThe LLM used for field extraction. The default value is watsonx/meta-llama/llama-3-2-90b-vision-instruct.
display_namestringNoDisplay name for the node.
fieldsobjectYesThe fields you want to extract.
descriptionstringNoDescription of the node.
input_mapDataMapNoDefine input mappings using a structured collection of Assignment objects.
enable_hwboolNoEnable the handwritten feature by setting this to true.
min_confidencefloatNoThe minimum acceptable confidence for an extracted field value.
review_fieldsList[string]NoThe fields that require user review.
enable_reviewboolNoEnables or disables the human-in-the-loop feature. Set to True to activate it and False to deactivate. The default value is False.
Note:The min_confidence and review_fields settings control the human-in-the-loop feature. This feature only works when you run the Flow from a chat session. If a field is extracted with confidence lower than min_confidence, and its name appears in review_fields, the agent opens a review window in the chat. You can then review and confirm the extracted values.
The input to a docext node is expected to be of type DocExtInput, from the module ibm_watsonx_orchestrate.flow_builder.types. Example use of the docext node in a agentic workflow:
Python
from pydantic import BaseModel, Field
from ibm_watsonx_orchestrate.flow_builder.flows import (
    Flow, flow, START, END
)
from ibm_watsonx_orchestrate.flow_builder.types import DocExtConfigField, DocumentProcessingCommonInput


class Fields(BaseModel):
    """
    Configuration schema for document extraction fields.
    
    Defines the fields to be extracted from contract documents, including
    their names, types, and descriptions. Each field is configured with
    a DocExtConfigField that specifies how the document extractor should
    identify and extract the information.
    
    In this example, we define a number of custom fields for a Contract or 
    Agreement document:
        buyer: The purchasing party in the contract
        seller: The selling party in the contract
        agreement_date: The date when the agreement was signed (date type)
        agreement_number: Unique identifier for the contract
        contract_type: Classification of the contract 
    """
    buyer: DocExtConfigField = Field(
        name="Buyer",
        default=DocExtConfigField(
            name="Buyer",
            field_name="buyer"
        )
    )
    
    seller: DocExtConfigField = Field(
        name="Seller",
        default=DocExtConfigField(
            name="Seller",
            field_name="seller"
        )
    )
    
    agreement_date: DocExtConfigField = Field(
        name="Agreement date",
        default=DocExtConfigField(
            name="Agreement Date",
            field_name="agreement_date",
            type="date"
        )
    )
    
    agreement_number: DocExtConfigField = Field(
        name="Agreement number",
        default=DocExtConfigField(
            name="Agreement Number",
            field_name="agreement_number",
            description="The identifier of this contract."
        )
    )
    
    contract_type: DocExtConfigField = Field(
        name="Contract type",
        default=DocExtConfigField(
            name="Contract Type",
            field_name="contract_type",
            type="string",
            description="The type of contract between the buyer and the seller."
        )
    )


@flow(
    name ="custom_flow_docext_example",
    display_name="custom_flow_docext_example",
    description="Extraction of custom fields from a document, specified by the user.",
    input_schema=DocumentProcessingCommonInput
)
def build_docext_flow(aflow: Flow = None) -> Flow:
    # aflow.docext returns 2 objects: the document extractor node and the schema of the extracted values.
    # In this example, doc_ext_node is the node and is added to the flow.
    # _ExtractedValues is the output schema of doc_ext_node and can be used as the input schema of nodes downstream in the flow.

    doc_ext_node, _ExtractedValues = aflow.docext(
        name="contract_extractor",
        display_name="Extract fields from a contract",
        description="Extracts fields from an input contract file",
        llm="watsonx/meta-llama/llama-3-2-90b-vision-instruct",
        fields=Fields()
    )

    aflow.sequence(START, doc_ext_node, END)
    return aflow