Use the document text extractor node to extract the texts from your documents.
This feature is currently in public preview. Functionality and behavior may change in future updates.

Pre-requisites

Run the following command to enable watsonx Orchestrate Developer Edition to process documents:
[BASH]
orchestrate server start -e <.env file path> -d
Note: You need to configure a minimum allocation of 20GB RAM to your Docker engine during installation of watsonx Orchestrate Developer edition to support document processing features.

Configuring document processing in flows

In your flow, include a call to the ‘docproc()’ method to process a document. This method accepts the following input arguments:
ParameterTypeRequiredDescription
namestringYesUnique identifier for the node.
taskstringYesSpecifies which information is extracted from the document upon processing; supported values are:
  • text_extraction: Extracts plain text from documents.
display_namestringNoDisplay name for the node.
descriptionstringNoDescription of the node.
The input to a docproc node is expected to be of type DocExtInput, from the module ibm_watsonx_orchestrate.flow_builder.types. Example use of the docproc node in a flow:
Python
from ibm_watsonx_orchestrate.flow_builder.flows import (
    Flow, flow, START, END
)

from ibm_watsonx_orchestrate.flow_builder.types import DocExtConfigEntity, DocProcInput, File, DocExtInput


@flow(
    name ="text_extraction_flow_example",
    display_name="text_extraction_flow_example",
    description="This flow consists of one node: a docproc node, which extracts text from the input document",
   input_schema=DocExtInput
)
def build_docproc_flow(aflow: Flow = None) -> Flow:
    doc_proc_node = aflow.docproc(
        name="text_extraction",
        display_name="text_extraction",
        description="Extract text out of a document's contents.",
        task="text_extraction"
    )

    aflow.sequence(START, doc_proc_node, END)
    return aflow