Skip to main content
Version: Next

Local Docker Compose Deployment for OCR

In this tutorial, we will run a jadice flow OCR worker in a local Docker environment via docker compose.

For example, this setup can be used to handle ad-hoc OCR requests by a jadice web toolkit integration (see jadice web toolkit OCR Addon).

Prerequisites

Docker installation, such as

Packages for other operating systems and instructions on how to install Docker can be found at docker.com.

We use the docker compose command to run the services later on.

To use the jadice flow components, a security token is needed (JADICE-FLOW-ACCESS-TOKEN). You should get a flow access token together with your license. For testing purposes, a test license can also be obtained by sending a request to jadice-support@levigo.de.

Check your access to https://artifacts.jadice.com/ and https://registry.jadice.com/. Images will be pulled from there. The "start-compose"-scripts we use later perform a docker login to the jadice registry.

How to complete this tutorial

You can start from scratch and create the complete configuration, or you can get the full example code and apply only the necessary changes. The docker type deployment is useful for development or quick test setups. For production use, a container management system like kubernetes is strongly recommended.

To start from scratch, move on to Configuration.

To skip the basics, do the following:

  • Download and unzip the source repository for this tutorial, or clone it using Git:
    git clone https://github.com/levigo/jadice-flow-getting-started.git
  • cd into jadice-flow-getting-started/jadice-flow-tutorial-01/docker-compose/local-ocr/
  • in the file .env replace
    • JADICE-FLOW-ACCESS-TOKEN with your access token
  • Jump ahead to Startup.

Configuration

In the following steps we create a configuration to provide the required services:

  1. jadice flow controller (jf-controller)
  2. Storage (eureka): The storage is used to store the input images for the OCR operations as well as their results.
  3. A worker configuration, here: jadice flow OCR Worker (jadice-flow-worker-tessocr)

Depending on the workers required for the specific tasks, more services may be required.

Creating the configuration folder

It is good practice to start by creating a directory where the configuration files will be placed. For example, on windows the path could be C:\Docker-Compose\JadiceFlow\local-ocr .

Common practice is to create a specific folder for each service within the main configuration folder. For example, controller-config/application.yml provides the configuration for the jadice flow controller. In the docker-compose.yml, this configuration folder is mounted into the container jf-controller as a docker volume.

In general, all services in docker-compose.yml can be configured in this manner, as we show in the subsequent paragraphs.

Creating a start script

Create a .env file with the required environment variables and then start the services by calling docker compose up.

In this example, we must set following variables:

  • JF_CONTAINER_REGISTRY_JADICE- Path to the jadice container registry (registry.jadice.com)
  • JF_ACCESS_TOKEN - An access token used by the controller service for authentication towards the worker
  • EUREKA_USERNAME- The username for the eureka storage
  • EUREKA_PASSWORD- The password for the eureka storage
  • COMPOSE_CONVERT_WINDOWS_PATHS- for host paths when running with Docker Desktop on Microsoft Windows

Create the file .env

.env
# container registries
JF_CONTAINER_REGISTRY_JADICE=registry.jadice.com/

# controller
JF_ACCESS_TOKEN=THE-[JADICE-FLOW-ACCESS-TOKEN]

# storage: eureka
EUREKA_USERNAME=user
EUREKA_PASSWORD=password

# for running with Docker Desktop on Microsoft Windows
COMPOSE_CONVERT_WINDOWS_PATHS = "1"

Replace [JADICE-FLOW-ACCESS-TOKEN] with your access token.

Finally, we can start the server by executing a script. During execution a login to the docker registry may be required.

Sample start scripts for Windows and Linux as follows.

Windows

Create the file start-compose.cmd:

start-compose.cmd
@echo off
echo Starting docker compose for jadice flow with OCR
echo Login to levigo container registry
docker login registry.jadice.com

IF NOT EXIST eureka-data mkdir eureka-data

docker compose --env-file .env up

Linux

Create the file start-compose.sh

start-compose.sh
#!/usr/bin/env bash

set -eu pipefail ;

_login_docker_registry(){
echo ">>>[start-compose] login to levigo container registry" ;
docker login registry.jadice.com ;
return 0 ;
} ;

_configure_container_mounts(){
echo ">>>[start-compose] configure container mounts" ;
local _sudo="" ;
local _uid="$(id -u)" ;
if [[ ! "${_uid}" == "0" ]] ; then
_sudo="sudo"
fi ;
${_sudo} mkdir -p ./eureka-data/ ;
${_sudo} chown -R ${_uid}:538446 ./controller-config/ ;
${_sudo} chown -R ${_uid}:538446 ./worker-config/ ;
${_sudo} chown -R ${_uid}:0 ./eureka-config/ ;
${_sudo} chown -R ${_uid}:0 ./eureka-data/ ;

return 0 ;
} ;

_start_docker_compose_stack(){
echo ">>>[start-compose] start docker-compose stack" ;
echo ">>>[start-compose] you can follow the logs with 'docker compose logs -f'" ;
docker compose up -d ;
return 0 ;
} ;


_main() {
_login_docker_registry ;
_configure_container_mounts ;
_start_docker_compose_stack ;
return 0 ;
}

_main ;

Create the 'docker-compose.yml'

The docker-compose.yml is the Docker-Compose main configuration file. Create this file in the configuration root folder.

Add the following services to this file:

  • jf-controller - jadice flow main service
  • Additionally, a worker is required. In the tutorial example, the jadice-flow-worker-tessocr OCR worker is used for OCR.

The service/container names in the docker-compose.yml can be used for network communication between the containers.

Example docker-compose.yml:

docker-compose.yml
---
version: "2.4"

networks:
jadice-flow-network:
driver: bridge

services:

jf-controller:
mem_limit: "4294967296"
mem_reservation: 2147483648
image: "${JF_CONTAINER_REGISTRY_JADICE}jadice-flow-controller:0.26.5"
user: '538446:538446'
networks:
- jadice-flow-network
restart: always
environment:
JF_ACCESS_TOKEN: ${JF_ACCESS_TOKEN}
EUREKA_ENDPOINT: http://eureka:8080
EUREKA_USERNAME: "${EUREKA_USERNAME}"
EUREKA_PASSWORD: "${EUREKA_PASSWORD}"
volumes:
- ./controller-config:/app/config
ports:
- "8080:8080"

jadice-flow-worker-tessocr:
mem_limit: "8589934592"
mem_reservation: "4294967296"
image: "${JF_CONTAINER_REGISTRY_JADICE}jf-worker-tessocr:1.8.0"
networks:
- jadice-flow-network
restart: always
user: '538446:538446'
environment:
EUREKA_ENDPOINT: http://eureka:8080
EUREKA_USERNAME: "${EUREKA_USERNAME}"
EUREKA_PASSWORD: "${EUREKA_PASSWORD}"
volumes:
- ./worker-config:/app/config
ports:
- "7081:8080"

eureka:
mem_limit: "4294967296"
mem_reservation: "2147483648"
image: "${JF_CONTAINER_REGISTRY_JADICE}neverpile-eureka-boxed:0.2.7"
restart: always
volumes:
- ./eureka-config:/config
- ./eureka-data:/data/neverpile-eureka_default
environment:
EUREKA_USERNAME: "${EUREKA_USERNAME}"
EUREKA_PASSWORD: "${EUREKA_PASSWORD}"
ports:
- "8085:8080"
networks:
- jadice-flow-network
...


JF_CONTAINER_REGISTRY_JADICE: A variable for the container registry to obtain jadice flow container images from. This can be the levigo registry 'registry.jadice.com' or a proxy of it.

The containers will be added to the jadice-flow-network. The main configuration parameters are taken from the predefined system variables set by the start script.

Now we can add additional configuration files to our services. As seen in the docker-compose.yml, the relative paths are mounted directly into the container's /config or /app/config-Directory.

jadice flow controller configuration

(Service name jf-controller in docker-compose.yml)

Create the following files:

  • controller-config/application.yml
  • controller-config/jobtemplates.yml
  • controller-config/workers.yml

These files contain the configuration of this jadice flow installation. They looks like this:

controller-config/application.yml
---
server:
port: 8080

spring:
config:
## worker and job configuration files:
import: "/app/config/jobtemplates.yaml,/app/config/workers.yaml"
datasource:
url: "jdbc:h2:mem:jadice-flow-db;INIT=CREATE SCHEMA IF NOT EXISTS JADICE_FLOW"
username: jadice-flow-controller
password: changemeorkeepmeidontcare

# H2-Console
h2-console-config:
enabled: true
port: 8082

# Storage
publisher:
# required
internalEndpoint: "http://eureka:8080"
eureka:
endpoint: ${EUREKA_ENDPOINT}
username: ${EUREKA_USERNAME}
password: ${EUREKA_PASSWORD}

# Jadice flow main config
jadice-flow:
server-url: http://localhost:8080/
securityToken: ${JF_ACCESS_TOKEN}
system:
lockJobConfiguration: false
configFileJobs: /app/config/jobtemplates.yaml

jadice:
license-configuration:
license: |
----BEGIN LICENSE----
abcdefghijklmnopqrstuvwxyz
----END LICENSE----
fingerprint: 1234567890
public-key: |
-----BEGIN PUBLIC KEY-----
abcdefghijklmnopqrstuvwxyz
-----END PUBLIC KEY-----
...

Special note to the following settings:

  • jadice-flow.securityToken - The access token required to access the flow workers.
  • jadice.license-configuration - a jadice license required to start up the controller and run jobs.

The jadice flow controller is using an H2 DB as its runtime database. Other runtime DBs can be configured via spring.datasource in the application.yml.

The jobtemplates.yml contains the definitions of the workflows. In this tutorial there is only one simple jobTemplate, with a single step: OCR

controller-config/jobtemplates.yml
jadice-flow.jobs:
jobTemplates:
- jobName: "ocr"
description: "Performs optical character recognition for the given input image(s). Default output is one plain text part and one HOCR part."
properties: {}
enabled: true
stepTemplates:
- stepName: "OCR"
workerDefinitionName: "TessOCR"
inputMimeTypes:
- "image/png"
- "application/pdf"
- "application/octet-stream"
- "image/jpeg"
- "image/tiff"
- "image/bmp"
- "image/gif"
expectsNewPartResult: true
markSrcAsMetaOnResult: true
parameters:
- name: "output-formats"
type: "com.jadice.flow.worker.ocr.OCROutputSetting"
subTypes: []
value: "\"TEXT_AND_HOCR\""
description: "OCR output format(s)"
jobFlow:
- from: ""
"on": "*"
to: "OCR"

The controller-config/workers.yml contains the definitions of the workers. In our case there is only the worker "TessOCR".

controller-config/workers.yml
jadice-flow.workers:
workerDefinitions:
- workerName: "TessOCR"
description: "Performs optical character recognition on the given image parts\
\ and stores the result as new part"
processorClass: "com.jadice.flow.controller.server.processor.impl.TessOCRProcessor"
workerURL: "http://jadice-flow-worker-tessocr:8080/"
infoTags:
- "PART_BASED"
- "IMAGE_PROCESSING"
- "REMOTE"
workerParameters:
- name: "output-formats"
type: "com.jadice.flow.worker.ocr.OCROutputSetting"
subTypes: null
value: "\"TEXT\""
description: "OCR output format(s)"

Storage Configuration

(Service names eureka from docker-compose.yml)

The eureka storage only requires setting a username and password. The values are set through docker-compose, but we need to add the placeholders to its application.yaml

Create the following file: eureka-config/application.yml

eureka-config/application.yml
---
server:
port: 8080
spring:
application:
name: neverpile eureka
security:
user:
name: "${EUREKA_USERNAME}"
password: "${EUREKA_PASSWORD}"
...

Note

The files will not be cleaned by the jadice flow components. It is the responsibility of the integrating application to perform a cleanup of the files.

Usually, the OCR Data is not needed for a long time; a simple mechanism could, for example, delete the eureka-data from within the Start-script before launching.

Instead of eureka you can also use an S3 compatible storage, like Amazon S3 or minio.

Jadice flow OCR worker

Create the file worker-config/application.yml

worker-config/application.yml
---
stage: dev
publisher:
eureka:
endpoint: "${EUREKA_ENDPOINT}"
username: "${EUREKA_USERNAME}"
password: "${EUREKA_PASSWORD}"

spring:
application:
name: jadice-flow-worker-ocr

opentracing:
jaeger:
log-spans: false
service-name: ${spring.application.name}
tags:
stage: ${stage}

management:
endpoint.health.enabled: true
endpoint.prometheus.enabled: true
endpoint.info.enabled: true
endpoints:
enabled-by-default: false
web:
exposure:
include: "health,prometheus,info"
metrics:
enable:
all: true
endpoint:
health:
show-details: always

logging:
level:
root: INFO
com.jadice.flow.worker.tessocr: INFO
...

Special note to the following settings:

  • publisher.eureka- The eureka configuration to access the image data and store results

Configuration summary

Finally, we have achieved the following directory structure:

  • controller-config/application.yml
  • eureka-config/application.yml
  • worker-config/application.yml
  • .env
  • docker-compose.yml
  • start-compose.sh or start-compose.cmd

Startup

Switch to the configuration root directory and start the jadice flow instance by simply running your command script start-compose in a command shell.

You can delete all created resources by running docker compose down.