> ## Documentation Index
> Fetch the complete documentation index at: https://developer.watson-orchestrate.ibm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Document text extractor node

Use the document text extractor node to extract the texts from your documents.

## Pre-requisites

Run the following command to enable watsonx Orchestrate Developer Edition to process documents:

```bash BASH theme={null}
orchestrate server start -e <.env file path> -d
```

<Note>
  **Note:**
  You need to configure a minimum allocation of 20GB RAM to your Docker engine during installation of watsonx Orchestrate Developer edition to support document processing features.
</Note>

## Configuring document processing in flows

In your agentic workflow, include a call to the 'docproc()' method to process a document. This method accepts the following input arguments:

| Parameter                | Type                    | Required | Description                                                                                                                                                                                                                                                                                        |
| ------------------------ | ----------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| name                     | string                  | Yes      | Unique identifier for the node.                                                                                                                                                                                                                                                                    |
| task                     | string                  | Yes      | Specifies which information is extracted from the document upon processing; supported values are:<ul><li>`text_extraction`: Extracts plain text from documents.</li></ul>                                                                                                                          |
| display\_name            | string                  | No       | Display name for the node.                                                                                                                                                                                                                                                                         |
| description              | string                  | No       | Description of the node.                                                                                                                                                                                                                                                                           |
| input\_map               | DataMap                 | No       | Define input mappings using a structured collection of Assignment objects.                                                                                                                                                                                                                         |
| document\_structure      | bool                    | No       | Controls whether the output includes additional fields as part of the document assembly. Set to `true` to include these fields.                                                                                                                                                                    |
| kvp\_schemas             | list\[DocProcKVPSchema] | No       | The key-value pair schemas used for extraction.                                                                                                                                                                                                                                                    |
| enable\_hw               | bool                    | No       | Enable the handwritten feature by setting this to `true`.                                                                                                                                                                                                                                          |
| kvp\_model\_name         | string                  | No       | The LLM model used for key-value pair extraction. Default value is `mistral-small-3-1-24b-instruct`. For more information about supported KVP models, see [Compatible Vision-Language Models](https://pages.github.ibm.com/ai-foundation/watson_doc_understanding/current/library/vlm_supported/). |
| kvp\_force\_schema\_name | string                  | No       | The schema name used for KVP extraction. If not set or None, uses the default schema.                                                                                                                                                                                                              |
| kvp\_enable\_text\_hints | bool                    | No       | Enables text hints to assist with KVP extraction.                                                                                                                                                                                                                                                  |

The input to a `docproc` node uses the `DocProcInput` type from the `ibm_watsonx_orchestrate.flow_builder.types` module. You can optionally configure the `kvp_schemas` parameter to define key-value pair input schemas. For more information, see [Semantic Key-Value Pair (KVP) Extraction](#semantic-key-value-pair-kvp-extraction).

Example use of the `docproc` node in a agentic workflow:

```py Python [expandable] theme={null}
from pydantic import BaseModel, Field
from ibm_watsonx_orchestrate.flow_builder.flows import (
    Flow, flow, START, END
)

from ibm_watsonx_orchestrate.flow_builder.types import DocProcInput


@flow(
    name ="text_extraction_flow_example",
    display_name="text_extraction_flow_example",
    description="This flow consists of one node: a docproc node, which extracts text from the input document",
    input_schema=DocProcInput
)
def build_docproc_flow(aflow: Flow = None) -> Flow:
    doc_proc_node = aflow.docproc(
        name="text_extraction",
        display_name="text_extraction",
        description="Extract text out of a document's contents.",
        task="text_extraction",
    )

    aflow.sequence(START, doc_proc_node, END)
    return aflow
```

After the node runs, you receive a URL pointing to a file that contains the extracted text. If you configure key-value pair (KVP) extraction, the file also includes the extracted KVPs.

### Semantic Key-Value Pair (KVP) Extraction

Use the `kvp_schemas` parameter in the text extraction task to extract semantic key-value pairs from input documents.

You can define this parameter in two places:

* **Node specification:** Set `kvp_schemas` in the node configuration. The node supports semantic KVP extraction in the following cases:

  * If you define `kvp_schemas` in the input, the node uses those schemas. If you pass an empty array, it falls back to default schemas.
  * If you don’t define `kvp_schemas` in the input but include them in the node specification, the node uses the specification-defined schemas. Again, if the array is empty, it defaults to the built-in schemas.

  The `kvp_schemas` is configured as a JSON object. The following example shows how to define `kvp_schemas` in a node configuration.

  ```py Python [expandable] theme={null}
  from pydantic import BaseModel, Field
  from ibm_watsonx_orchestrate.flow_builder.flows import (
      Flow, flow, START, END
  )

  from ibm_watsonx_orchestrate.flow_builder.types import DocProcInput


  @flow(
      name ="kvp_extraction_flow_example_for_api",
      display_name="kvp_extraction_flow_example_for_api",
      description="This flow consists of one node: a docproc node, which extracts kvps from the input document",
      input_schema=DocProcInput
  )
  def build_docproc_flow(aflow: Flow = None) -> Flow:
      doc_proc_node = aflow.docproc(
          name="kvp_extraction",
          display_name="kvp_extraction",
          description="Extract kvp out of a document's contents.",
          task="text_extraction",
          kvp_schemas=[{
          "document_type": "Invoice",
          "document_description": "An invoice is a standard document issued by a seller to a buyer, outlining products or services provided, quantities, prices, and payment terms.",
          "additional_prompt_instructions": "Extract values exactly as they appear in the document, especially numbers and codes.",
          "fields": {
          "invoice_number": {
              "description": "A unique identifier assigned by the vendor for this invoice.",
              "example": "2023-AUS-987654",
              "default": ""
          },
          "document_date": {
              "description": "Date of the document.",
              "example": "2025-07-05",
              "default": ""
          },
          "vendor_name": {
              "description": "Legal or trade name of the company issuing the invoice. Usually located in the header or footer, near the logo, or billing details.",
              "example": "ABC Supply Company Pty Ltd",
              "default": ""
          },
          "vendor_number": {
              "description": "Internal identifier used by the buyer's system to refer to the vendor.",
              "example": "VEND-1023",
              "default": ""
          }
          }
      }]
      )

      aflow.sequence(START, doc_proc_node, END)
      return aflow
  ```

* **Runtime input:**  Set `kvp_schemas` in the input payload.

  The following example shows how to define a `kvp_schemas` in the input payload.

  ```py Python [expandable] theme={null}
  import asyncio
  import logging
  import sys
  import json
  from pathlib import Path

  from examples.flow_builder.text_extraction.tools.text_extraction_flow import build_docproc_flow

  logger = logging.getLogger(__name__)


  def on_flow_end(result):
      """
      Callback function to be called when the flow is completed.
      """
      print(f"Custom Handler: flow `{flow_run.name}` completed with result: {result}")


  def on_flow_error(error):
      """
      Callback function to be called when the flow fails.
      """
      print(f"Custom Handler: flow `{flow_run.name}` failed: {error}")


  async def main(doc_ref: str, kvp_schema_path):
      '''A function demonstrating how to build a flow and save it to a file.'''
      my_flow_definition = await build_docproc_flow().compile_deploy()
      global flow_name
      flow_name = my_flow_definition.flow.spec.display_name
      generated_folder = f"{Path(__file__).resolve().parent}/generated"
      my_flow_definition.dump_spec(f"{generated_folder}/docproc_flow_spec.json")

      with open(kvp_schema_path, 'r') as file:
          # Load the JSON data from the file into a Python dictionary
          schema_json = json.load(file)

      global flow_run
      flow_run = await my_flow_definition.invoke(
          {
          "document_ref": doc_ref, 
          "language": "en",
          "kvp_schemas": [ schema_json ]
          }, 
          on_flow_end_handler=on_flow_end, on_flow_error_handler=on_flow_error, debug=True)


  if __name__ == "__main__":
      if len(sys.argv) != 3:
          logger.error(f"Usage: {sys.argv[0]} file_store_path kvp_schema_path")
      else:
          asyncio.run(main(sys.argv[1], sys.argv[2]))
  ```

For both the node specification and runtime input, the default value of `kvp_schemas` is null. If you define the parameter in both places, the runtime input takes precedence and overrides the node specification. To use predefined extraction schemas, pass an empty array `(kvp_schemas: [])`.

Use `kvp_force_schema_name` to specify the schema name for KVP extraction. Use `kvp_enable_text_hints` to enable or disable text hints during extraction. Configure both settings inside the `docproc(`) node.

```py Python [expandable] theme={null}
    @flow(
        name="text_extraction_kvps_flow_example",
        display_name="Text Extraction with KVP Extraction Flow",
        description=(
            "This flow consists of one node: a docproc node, which extracts text "
            "and custom key-value pairs from the input document using a predefined schema"
        ),
        input_schema=DocProcInput,
    )
    def build_docproc_flow(aflow: Flow) -> Flow:
        """
        Build a text extraction flow with Key-Value Pair (KVP) extraction capabilities.
        
        This flow creates a document processing pipeline that:
        1. Extracts raw text content from input documents
        2. Identifies and extracts structured key-value pairs based on a predefined schema
        
        The KVP extraction uses a schema-driven approach to identify specific fields
        like company information, invoice numbers, dates, and line items from invoices.
        
        Runtime Parameter Override:
            All KVP-related parameters (kvp_schemas, kvp_model_name, kvp_force_schema_name,
            kvp_enable_text_hints) can be provided at runtime when invoking the flow.
            Runtime values will override the default values configured in this flow definition.
            This allows for dynamic configuration without modifying the flow code.
        
        Args:
            aflow: Flow builder instance provided by the @flow decorator.
        
        Returns:
            Flow: Configured text extraction flow with KVP extraction
                (START → docproc_node → END)
            
        Example:
            >>> flow = build_docproc_flow()
            >>> # Flow will extract text and KVPs from documents
        """
        # Validate Flow instance (defensive programming)
        assert aflow is not None, "Flow instance must be provided by the @flow decorator"
        
        try:
            # Get the KVP schema for invoice processing
            kvp_schemas = get_sample_invoice_kvp_schema()
            
            # Create document processing node configured for text and KVP extraction
            doc_proc_node = aflow.docproc(
                name="text_extraction_with_kvp_node",
                display_name="Text Extraction with KVP Node",
                description="Extracts raw text and structured key-value pairs from an input document",
                task="text_extraction",
                # KVP extraction parameters.
                #  Note: All kvp_* parameters below can be overridden at runtime when invoking the flow.
                #  Runtime values take precedence over these default configuration values.
                kvp_schemas=kvp_schemas,  # type: ignore[arg-type]  # Can be overridden at runtime
                # Optional: Uncomment to use a specific LLM for KVP extraction. This defaults to mistral-small. (can be overridden at runtime)
                kvp_model_name="watsonx/mistralai/mistral-small-3-1-24b-instruct-2503",
                # Optional: Uncomment to force a specific schema (can be overridden at runtime)
                kvp_force_schema_name="MyInvoice", # The "document_type" that will be used for KVP extraction. If not specified, the engine will try to match the input document to the given schemas.
                # Optional: Enable/disable text hints for KVP extraction (can be overridden at runtime)
                kvp_enable_text_hints=True,
            )
            
            # Connect nodes in sequence: START → doc_proc_node → END
            aflow.sequence(START, doc_proc_node, END)
            
            logger.info("Text extraction with KVP flow built successfully")
            return aflow
            
        except Exception as e:
            logger.error(f"Failed to build text extraction KVP flow: {e}", exc_info=True)
            raise
```
