Skip to main content
Use the document field extractor node to extract specific fields from your documents.
This feature is currently in public preview. Functionality and behavior may change in future updates.

Pre-requisites

Run the following command to enable watsonx Orchestrate Developer Edition to process documents:
BASH
orchestrate server start -e <.env file path> -d
Note: You need to configure a minimum allocation of 20GB RAM to your Docker engine during installation of watsonx Orchestrate Developer edition to support document processing features.
Note: To run the document field extractor, you must define the WATSONX_SPACE_ID, WATSONX_APIKEY, and WATSONX_PROJECT_ID credentials in your .env file. For more information on configuring the .env file, see Installing the watsonx Orchestrate Developer Edition.

Configuring document extractor node in agentic workflows

  1. Define the fields to extract Create a class that defines the fields you want to extract. Each field must follow this structure:
    Python
    field: DocExtConfigField = Field(name="Field name", default=DocExtConfigField(name="Field name", field_name="field_name"))
    
    Class example:
    Python
    class Fields(BaseModel):
        buyer: DocExtConfigField = Field(name="Buyer", default=DocExtConfigField(name="Buyer", field_name="buyer"))
        seller: DocExtConfigField = Field(name="Seller", default=DocExtConfigField(name="Seller", field_name="seller"))
        agreement_date: DocExtConfigField = Field(name="Agreement date", default=DocExtConfigField(name="Agreement Date", field_name="agreement_date", type="date"))
    
  2. Configure the document extract node
In your agentic workflow, include a call to the docext() method to extract an field from a document. This method accepts the following input arguments:
ParameterTypeRequiredDescription
namestringYesUnique identifier for the node.
llmstringYesThe LLM used for field extraction.
display_namestringNoDisplay name for the node.
fieldsobjectYesThe fields you want to extract.
descriptionstringNoDescription of the node.
input_mapDataMapNoDefine input mappings using a structured collection of Assignment objects.
enable_hwboolNoEnable the handwritten feature by setting this to true.
min_confidencefloatNoThe minimum acceptable confidence for an extracted field value.
review_fieldsList[string]NoThe fields that require user review.
enable_reviewboolNoEnables or disables the human-in-the-loop feature. Set to True to activate it and False to deactivate. The default value is False.
Note:The min_confidence and review_fields settings control the human-in-the-loop feature. This feature only works when you run the Flow from a chat session. If a field is extracted with confidence lower than min_confidence, and its name appears in review_fields, the agent opens a review window in the chat. You can then review and confirm the extracted values.
The input to a docext node is expected to be of type DocExtInput, from the module ibm_watsonx_orchestrate.flow_builder.types. Example use of the docext node in a agentic workflow:
Python
from pydantic import BaseModel, Field
from ibm_watsonx_orchestrate.flow_builder.flows import (
    Flow, flow, START, END
)
from ibm_watsonx_orchestrate.flow_builder.types import DocExtConfigField, DocumentProcessingCommonInput


class Fields(BaseModel):
    buyer: DocExtConfigField = Field(name="Buyer", default=DocExtConfigField(name="Buyer", field_name="buyer"))
    seller: DocExtConfigField = Field(name="Seller", default=DocExtConfigField(name="Seller", field_name="seller"))
    agreement_date: DocExtConfigField = Field(name="Agreement date", default=DocExtConfigField(name="Agreement Date", field_name="agreement_date", type="date"))

class Illinoise(BaseModel):
    social_security_number: DocExtConfigField = Field(name="Social Security Number", default=DocExtConfigField(name="Social Security Number", field_name="social_security_number"))
    driver_license_number: DocExtConfigField = Field(name="Driver's License Number", default=DocExtConfigField(name="Driver's License Number", field_name="driver_license_number"))


@flow(
    name ="custom_flow_docext_example",
    display_name="custom_flow_docext_example",
    description="Extraction of custom fields from a document, specified by the user.",
    input_schema=DocumentProcessingCommonInput
)
def build_docext_flow(aflow: Flow = None) -> Flow:
    # aflow.docext return 2 things
    # doc_ext_node which is a node to be added into aflow
    # ExtractedValues is the ouput schema of aflow.docext and it can be pass to other nodes as input schema

    doc_ext_node, ExtractedValues = aflow.docext(
        name="contract_extractor",
        display_name="Extract fields from a contract",
        description="Extracts fields from an input contract file",
        llm="watsonx/meta-llama/llama-3-2-90b-vision-instruct",
        fields=Illinoise(),
        enable_hw=True
    )

    aflow.sequence(START, doc_ext_node, END)
    return aflow