Skip to main content
Version: 1.3.2

Source scanner

The jf-source-scanner is an optional component implemented as SpringBoot application. It runs in a separate container and polls one or more external datasources to create jobs in the flow controller to be processed.

Installation

Standalone package

The source scanner can be started like other workers as a separate component in a standalone package. Via the provided wrapper command files, the source scanner can be run or installed as service.

Helm chart

A helm chart for the jf-source-scanner is available.

Configuration

Overview

The source scanner provides an API to handle the polling on an external data source. The following sections describe the configurable scanners. Scanners are configured in the application.yaml file of the jf-source-scanner component. The scanner also provides a rest API to query its state.

The scanner will:

  • poll for new input data
  • upload the input binary data into the flow storage
  • create jobs in the flow controller
  • check regularly (or based on event bus) for job completion
  • handle the final state (e.g. moving it to FINISHED or ERROR state directory or setting a specific status via JDBC in a database)

Configuration

Each scanner has the following base parameters:

Example application.yaml snippet for a file scanner ('scanner1')
scanners:
# When a job finishes, an event is sent. This can be used to instantly handle the final state (e.g. moving to FINISHED dir)
# Event handling can also be disabled and the "checkFinishedFeedbackInterval" can be used to check for those jobs regularly.
# To disable the global thread, set to 0m (if event bus is enabled).
checkFinishedFeedbackInterval: 1m
useEventBus: true
configs:
- name: scanner1
scannerClassName: "com.jadice.flow.controller.scanner.file.FileScanner"
recoverOnStartup: true
sizeLimits:
maxItemsPerJob: 100
maxJobsPerInterval: 100
pollingInterval:
pollingInterval: 10
pollingIntervalUnit: "SECONDS"
enableBackpressureCheck: false
backpressureHighWatermark: 100
backpressureLowWatermark: 10
backpressureSizeLimits:
maxItemsPerJob: 10
maxJobsPerInterval: 1
backpressurePollingInterval:
pollingInterval: 5
pollingIntervalUnit: "MINUTES"
properties:
inputDirectoryPath: "${user.dir}/workdir/input"
stateDirectoryPath: "${user.dir}/workdir/scanner1"
asyncUploadThreadCount: "5"
metaFileSuffix: ".xml"
# The following setting allows to override the job template name from the File JobRequest.
#overrideJobTemplateName: "Local-Sleep"
fileReaderClass: "com.jadice.flow.controller.scanner.file.DefaultXmlFormatReader"

Most noteworthy configuration parameters are:

  • checkFinishedFeedbackInterval : By default, created job IDs are held in a list. In a regular interval, those jobs are checked if they are finished. This interval can be set to 0m if the event bus is being used. With event bus, the finished state will be received by a listener on the event bus so the regular interval can be disabled (but also does no harm if still enabled).
  • useEventBus : If enabled, a listener will be registered on the event bus. The controller must also have event bus enabled so it sends job events via the event bus. This leads to the scanner instantly handling the finished job rather than doing it regularly in batches for the created jobs.
  • scannerClassName : The name of the scanner class to use.
  • sizeLimits : Implementation specific: The desired maximum sizes for jobs. It depends on the scanner implementation if the limit can be used (e.g. the FileScanner will read the full input file regardless of the setting).
  • pollingInterval : In this section the polling interval is defined. The possible values for pollingIntervalUnit are the enum values of java.util.concurrent.TimeUnit, e.g. MINUTES, SECONDS
  • enableBackpressureCheck : Whether to use the backpressure check or not. If enabled, the backpressureHighWatermark and backpressureLowWatermark values are used for backpressure handling. When a job is created and its returned queue position is above the backpressureHighWatermark, the scanner will switch into the RUNNING_BACKPRESSURE mode. This mode has the separate backpressurePollingInterval section to define a slower running interval in backpressure mode. The scanner returns to RUNNING mode if a created job has a queue position below the backpressureLowWatermark.
  • properties section contains specific settings for the scanner implementation. Those settings are defined in the following chapters with the corresponding scanner.

File scanner

The file scanner is a basic file polling scanner. It uses state directories to handle the input data. Input data can be in different formats, the default format is a XML described in a following sub chapter.

In the configured stateDirectoryPath (see example below), the scanner will use the following sub-directories:

  • accepted: Input data for which a job has been created. The scanner will move data from input directory to this directory.
  • done: Finished jobs
  • failure: Failed jobs

A job directory follows the naming scheme JobID-TemplateName, e.g. 42-ImageToTiffWithHocr.

Workflow / Usage:

Input data should be provided in the inputDirectoryPath inside a folder (one folder per job). This folder should contain the meta file (e.g. "input.xml") and binary data files (sub directories optionally allowed). The "input.xml" file references the input binary files via the "filename" attribute of the parts.

File scanner configuration:

Example application.yaml snippet for a file scanner ('scanner1')
  - name: scanner1
-- ... Default scanner parameters omitted in this example (see above example for those)
properties:
inputDirectoryPath: "${user.dir}/workdir/input"
stateDirectoryPath: "${user.dir}/workdir/scanner1"
asyncUploadThreadCount: "5"
metaFileSuffix: ".xml"
# The following setting allows to override the job template name from the File JobRequest.
#overrideJobTemplateName: "Local-Sleep"
fileReaderClass: "com.jadice.flow.controller.scanner.file.DefaultXmlFormatReader"

Most noteworthy configuration parameters are:

  • inputDirectoryPath : The input directory path
  • stateDirectoryPath : The state directory base path. It is recommended to use a separate state directory for each scanner if recoverOnStartup is enabled.
  • asyncUploadThreadCount : The scanner performs the upload of the input data into the flow storage system. This async thread count is the concurrent upload thread count.
  • metaFileSuffix : The file suffix for the meta file, e.g. ".xml"
  • overrideJobTemplateName : (optional): Useful in testing scenarios. The jobTemplateName is given in the XML-file/JobRequest. This setting allows to override the given name (e.g. to use the input data with another template).
  • fileReaderClass : The specific file reader implementation which is able to read the meta file, e.g. com.jadice.flow.controller.scanner.file.DefaultXmlFormatReader

Default XML Format

The default XML format is a simple XML. Generally, it is a JobRequest (as being used by the flow client or rest api) serialized in a file using the ObjectMapper in XML format.

Example input.xml
<JobRequest>
<id>42f997fc-d085-4d9b-968b-3d73370412a3</id>
<jobTemplateName>toTIFF</jobTemplateName>
<creatorName />
<jobParameterMap />
<priority>0</priority>
<items>
<items>
<processingProperties />
<indexData />
<parts>
<parts>
<url />
<filename>3-pages.pdf</filename>
<type>BASE_PART</type>
<mimeType>application/pdf</mimeType>
<processingProperties />
</parts>

<parts>
<url />
<filename>5-pages.pdf</filename>
<type>BASE_PART</type>
<mimeType>application/pdf</mimeType>
<processingProperties />
</parts>
</parts>
</items>
</items>
</JobRequest>

This is an example request for one item with 2 files "3-pages.pdf" and "5-pages.pdf" to be processed by the "ToTIFF" job template into one tiff.

Most noteworthy:

  • the JobRequest can have multiple items with multiple parts.
  • The filename must be set to a relative path to the XML file, so the scanner will upload it into the storage
  • If an optional url is already provided, no upload will take place by the scanner

Rest API

The source scanner provides a basic rest API to check the status and enable / disable single scanners.

Endpoints are:

  • GET /status/SCANNER_NAME : where SCANNER_NAME is the name of the desired scanner
  • GET /start/SCANNER_NAME : where SCANNER_NAME is the name of the desired scanner
  • GET /stop/SCANNER_NAME : where SCANNER_NAME is the name of the desired scanner
  • GET /status-all : Retrieves the status for all running scanners

Example status result:

Example result for /status-all endpoint
<Map><scanner1><state>RUNNING</state><statusMessage/><flowServerUrl/></scanner1></Map>

In this result, the scanner1 has the state RUNNING. Other possible states are RUNNING_BACKPRESSURE, STOPPED and FAILURE.