Source scanner
The jf-source-scanner
is an optional component implemented as SpringBoot application. It runs in a separate container and polls one or more external datasources to create jobs in the flow controller to be processed.
Installation
Standalone package
The source scanner can be started like other workers as a separate component in a standalone package. Via the provided wrapper command files, the source scanner can be run or installed as service.
Helm chart
A helm chart for the jf-source-scanner
is available.
Configuration
Overview
The source scanner provides an API to handle the polling on an external data source. The following sections describe the configurable scanners.
Scanners are configured in the application.yaml
file of the jf-source-scanner
component. The scanner also provides a rest API to query its state.
The scanner will:
- poll for new input data
- upload the input binary data into the flow storage
- create jobs in the flow controller
- check regularly (or based on event bus) for job completion
- handle the final state (e.g. moving it to FINISHED or ERROR state directory or setting a specific status via JDBC in a database)
Configuration
Each scanner has the following base parameters:
Example application.yaml snippet for a file scanner ('scanner1')
scanners:
# When a job finishes, an event is sent. This can be used to instantly handle the final state (e.g. moving to FINISHED dir)
# Event handling can also be disabled and the "checkFinishedFeedbackInterval" can be used to check for those jobs regularly.
# To disable the global thread, set to 0m (if event bus is enabled).
checkFinishedFeedbackInterval: 1m
useEventBus: true
configs:
- name: scanner1
scannerClassName: "com.jadice.flow.controller.scanner.file.FileScanner"
recoverOnStartup: true
sizeLimits:
maxItemsPerJob: 100
maxJobsPerInterval: 100
pollingInterval:
pollingInterval: 10
pollingIntervalUnit: "SECONDS"
enableBackpressureCheck: false
backpressureHighWatermark: 100
backpressureLowWatermark: 10
backpressureSizeLimits:
maxItemsPerJob: 10
maxJobsPerInterval: 1
backpressurePollingInterval:
pollingInterval: 5
pollingIntervalUnit: "MINUTES"
properties:
inputDirectoryPath: "${user.dir}/workdir/input"
stateDirectoryPath: "${user.dir}/workdir/scanner1"
asyncUploadThreadCount: "5"
metaFileSuffix: ".xml"
# The following setting allows to override the job template name from the File JobRequest.
#overrideJobTemplateName: "Local-Sleep"
fileReaderClass: "com.jadice.flow.controller.scanner.file.DefaultXmlFormatReader"
Most noteworthy configuration parameters are:
checkFinishedFeedbackInterval
: By default, created job IDs are held in a list. In a regular interval, those jobs are checked if they are finished. This interval can be set to 0m if the event bus is being used. With event bus, the finished state will be received by a listener on the event bus so the regular interval can be disabled (but also does no harm if still enabled).useEventBus
: If enabled, a listener will be registered on the event bus. The controller must also have event bus enabled so it sends job events via the event bus. This leads to the scanner instantly handling the finished job rather than doing it regularly in batches for the created jobs.scannerClassName
: The name of the scanner class to use.sizeLimits
: Implementation specific: The desired maximum sizes for jobs. It depends on the scanner implementation if the limit can be used (e.g. the FileScanner will read the full input file regardless of the setting).pollingInterval
: In this section the polling interval is defined. The possible values forpollingIntervalUnit
are the enum values ofjava.util.concurrent.TimeUnit
, e.g.MINUTES
,SECONDS
enableBackpressureCheck
: Whether to use the backpressure check or not. If enabled, thebackpressureHighWatermark
andbackpressureLowWatermark
values are used for backpressure handling. When a job is created and its returned queue position is above thebackpressureHighWatermark
, the scanner will switch into the RUNNING_BACKPRESSURE mode. This mode has the separatebackpressurePollingInterval
section to define a slower running interval in backpressure mode. The scanner returns to RUNNING mode if a created job has a queue position below thebackpressureLowWatermark
.properties
section contains specific settings for the scanner implementation. Those settings are defined in the following chapters with the corresponding scanner.
File scanner
The file scanner
is a basic file polling scanner. It uses state directories to handle the input data. Input data can be in different formats, the default format is a XML described in a following sub chapter.
In the configured stateDirectoryPath
(see example below), the scanner will use the following sub-directories:
accepted
: Input data for which a job has been created. The scanner will move data from input directory to this directory.done
: Finished jobsfailure
: Failed jobs
A job directory follows the naming scheme JobID-TemplateName
, e.g. 42-ImageToTiffWithHocr
.
Workflow / Usage:
Input data should be provided in the inputDirectoryPath
inside a folder (one folder per job). This folder should contain the meta file (e.g. "input.xml") and binary data files (sub directories optionally allowed).
The "input.xml" file references the input binary files via the "filename" attribute of the parts.
File scanner configuration:
Example application.yaml snippet for a file scanner ('scanner1')
- name: scanner1
-- ... Default scanner parameters omitted in this example (see above example for those)
properties:
inputDirectoryPath: "${user.dir}/workdir/input"
stateDirectoryPath: "${user.dir}/workdir/scanner1"
asyncUploadThreadCount: "5"
metaFileSuffix: ".xml"
# The following setting allows to override the job template name from the File JobRequest.
#overrideJobTemplateName: "Local-Sleep"
fileReaderClass: "com.jadice.flow.controller.scanner.file.DefaultXmlFormatReader"
Most noteworthy configuration parameters are:
inputDirectoryPath
: The input directory pathstateDirectoryPath
: The state directory base path. It is recommended to use a separate state directory for each scanner ifrecoverOnStartup
is enabled.asyncUploadThreadCount
: The scanner performs the upload of the input data into the flow storage system. This async thread count is the concurrent upload thread count.metaFileSuffix
: The file suffix for the meta file, e.g. ".xml"overrideJobTemplateName
: (optional): Useful in testing scenarios. The jobTemplateName is given in the XML-file/JobRequest. This setting allows to override the given name (e.g. to use the input data with another template).fileReaderClass
: The specific file reader implementation which is able to read the meta file, e.g.com.jadice.flow.controller.scanner.file.DefaultXmlFormatReader
Default XML Format
The default XML format is a simple XML. Generally, it is a JobRequest
(as being used by the flow client or rest api) serialized in a file using the ObjectMapper in XML format.
Example input.xml
<JobRequest>
<id>42f997fc-d085-4d9b-968b-3d73370412a3</id>
<jobTemplateName>toTIFF</jobTemplateName>
<creatorName />
<jobParameterMap />
<priority>0</priority>
<items>
<items>
<processingProperties />
<indexData />
<parts>
<parts>
<url />
<filename>3-pages.pdf</filename>
<type>BASE_PART</type>
<mimeType>application/pdf</mimeType>
<processingProperties />
</parts>
<parts>
<url />
<filename>5-pages.pdf</filename>
<type>BASE_PART</type>
<mimeType>application/pdf</mimeType>
<processingProperties />
</parts>
</parts>
</items>
</items>
</JobRequest>
This is an example request for one item with 2 files "3-pages.pdf" and "5-pages.pdf" to be processed by the "ToTIFF" job template into one tiff.
Most noteworthy:
- the
JobRequest
can have multipleitems
with multipleparts
. - The
filename
must be set to a relative path to the XML file, so the scanner will upload it into the storage - If an optional
url
is already provided, no upload will take place by the scanner
Rest API
The source scanner provides a basic rest API to check the status and enable / disable single scanners.
Endpoints are:
- GET
/status/SCANNER_NAME
: where SCANNER_NAME is the name of the desired scanner - GET
/start/SCANNER_NAME
: where SCANNER_NAME is the name of the desired scanner - GET
/stop/SCANNER_NAME
: where SCANNER_NAME is the name of the desired scanner - GET
/status-all
: Retrieves the status for all running scanners
Example status result:
Example result for /status-all endpoint
<Map><scanner1><state>RUNNING</state><statusMessage/><flowServerUrl/></scanner1></Map>
In this result, the scanner1
has the state RUNNING
. Other possible states are RUNNING_BACKPRESSURE
, STOPPED
and FAILURE
.