> ## Documentation Index
> Fetch the complete documentation index at: https://developer.watson-orchestrate.ibm.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Document field extractor node

Use the document field extractor node to extract specific fields from your documents.

## Pre-requisites

Run the following command to enable watsonx Orchestrate Developer Edition to process documents:

```bash BASH theme={null}
orchestrate server start -e <.env file path> -d
```

<Note>
  **Note:**
  You need to configure a minimum allocation of 20GB RAM to your Docker engine during installation of watsonx Orchestrate Developer edition to support document processing features.
</Note>

## Limitations

Text extractors have the following limits and restrictions:

| Area                              | Description                                                                                                                 |
| --------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| Maximum file size                 | 10 MB (except for Microsoft Excel files). <Note>**Note:** The maximum file size for Microsoft Excel files is 0.1 MB.</Note> |
| Maximum number of uploaded files  | 5 files                                                                                                                     |
| Accepted file types               | .doc, .docx, .jpe, .jpeg .jpg, .pdf, .png, .ppt, .pptx, .tif, .tiff, and .xlsx                                              |
| Maximum number of pages           | 600 pages                                                                                                                   |
| Maximum number of images          | No limit                                                                                                                    |
| Maximum number of images per page | No limit                                                                                                                    |

<Note>
  **Note:**
  To run the document field extractor, you must define the `WATSONX_SPACE_ID`, `WATSONX_APIKEY`, and `WATSONX_PROJECT_ID` credentials in your `.env` file. For more information on configuring the `.env` file, see [Installing the watsonx Orchestrate Developer Edition](../../developer_edition/wxOde_setup).
</Note>

## Configuring document extractor node in agentic workflows

1. Define the fields to extract.
   Create a class that defines the fields you want to extract. Each field must follow this structure:

   ```py Python theme={null}
   field: DocExtConfigField = Field(name="Field name", default=DocExtConfigField(name="Field name", field_name="field_name"))
   ```

   Class example:

   ```py Python [expandable] theme={null}
   class Fields(BaseModel):
       """
       Configuration schema for document extraction fields.
       
       Defines the fields to be extracted from contract documents, including
       their names, types, and descriptions. Each field is configured with
       a DocExtConfigField that specifies how the document extractor should
       identify and extract the information.
       
       In this example, we define a number of custom fields for a Contract or 
       Agreement document:
           buyer: The purchasing party in the contract
           seller: The selling party in the contract
           agreement_date: The date when the agreement was signed (date type)
           agreement_number: Unique identifier for the contract
           contract_type: Classification of the contract 
       """
       buyer: DocExtConfigField = Field(
           name="Buyer",
           default=DocExtConfigField(
               name="Buyer",
               field_name="buyer"
           )
       )
   ```

2. Configure the document extract node

In your agentic workflow, include a call to the `docext()` method to extract an field from a document. This method accepts the following input arguments:

| Parameter       | Type          | Required | Description                                                                                                                              |
| --------------- | ------------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| name            | string        | Yes      | Unique identifier for the node.                                                                                                          |
| llm             | string        | Yes      | The LLM used for field extraction. The default value is `groq/openai/gpt-oss-120b`.                                                      |
| display\_name   | string        | No       | Display name for the node.                                                                                                               |
| fields          | object        | Yes      | The fields you want to extract.                                                                                                          |
| description     | string        | No       | Description of the node.                                                                                                                 |
| input\_map      | DataMap       | No       | Define input mappings using a structured collection of Assignment objects.                                                               |
| enable\_hw      | bool          | No       | Enable the handwritten feature by setting this to `true`.                                                                                |
| min\_confidence | float         | No       | The minimum acceptable confidence for an extracted field value.                                                                          |
| review\_fields  | List\[string] | No       | The fields that require user review.                                                                                                     |
| enable\_review  | bool          | No       | Enables or disables the human-in-the-loop feature. Set to `True` to activate it and `False` to deactivate. The default value is `False`. |

<Note>
  **Note:**

  The `min_confidence` and `review_fields` settings control the human-in-the-loop feature. This feature only works when you run the Flow from a chat session.
  If a field is extracted with confidence lower than `min_confidence`, and its name appears in `review_fields`, the agent opens a review window in the chat. You can then review and confirm the extracted values.
</Note>

The input to a `docext` node is expected to be of type `DocExtInput`, from the module `ibm_watsonx_orchestrate.flow_builder.types`.

Example use of the `docext` node in a agentic workflow:

```py Python [expandable] theme={null}
from pydantic import BaseModel, Field
from ibm_watsonx_orchestrate.flow_builder.flows import (
    Flow, flow, START, END
)
from ibm_watsonx_orchestrate.flow_builder.types import DocExtConfigField, DocumentProcessingCommonInput


class Fields(BaseModel):
    """
    Configuration schema for document extraction fields.
    
    Defines the fields to be extracted from contract documents, including
    their names, types, and descriptions. Each field is configured with
    a DocExtConfigField that specifies how the document extractor should
    identify and extract the information.
    
    In this example, we define a number of custom fields for a Contract or 
    Agreement document:
        buyer: The purchasing party in the contract
        seller: The selling party in the contract
        agreement_date: The date when the agreement was signed (date type)
        agreement_number: Unique identifier for the contract
        contract_type: Classification of the contract 
    """
    buyer: DocExtConfigField = Field(
        name="Buyer",
        default=DocExtConfigField(
            name="Buyer",
            field_name="buyer"
        )
    )
    
    seller: DocExtConfigField = Field(
        name="Seller",
        default=DocExtConfigField(
            name="Seller",
            field_name="seller"
        )
    )
    
    agreement_date: DocExtConfigField = Field(
        name="Agreement date",
        default=DocExtConfigField(
            name="Agreement Date",
            field_name="agreement_date",
            type="date"
        )
    )
    
    agreement_number: DocExtConfigField = Field(
        name="Agreement number",
        default=DocExtConfigField(
            name="Agreement Number",
            field_name="agreement_number",
            description="The identifier of this contract."
        )
    )
    
    contract_type: DocExtConfigField = Field(
        name="Contract type",
        default=DocExtConfigField(
            name="Contract Type",
            field_name="contract_type",
            type="string",
            description="The type of contract between the buyer and the seller."
        )
    )


@flow(
    name ="custom_flow_docext_example",
    display_name="custom_flow_docext_example",
    description="Extraction of custom fields from a document, specified by the user.",
    input_schema=DocumentProcessingCommonInput
)
def build_docext_flow(aflow: Flow = None) -> Flow:
    # aflow.docext returns 2 objects: the document extractor node and the schema of the extracted values.
    # In this example, doc_ext_node is the node and is added to the flow.
    # _ExtractedValues is the output schema of doc_ext_node and can be used as the input schema of nodes downstream in the flow.

    doc_ext_node, _ExtractedValues = aflow.docext(
        name="contract_extractor",
        display_name="Extract fields from a contract",
        description="Extracts fields from an input contract file",
        llm="groq/openai/gpt-oss-120b",
        fields=Fields()
    )

    aflow.sequence(START, doc_ext_node, END)
    return aflow
```
