Boefjes

This module has several entry points discussed below, but let us first consider the prerequisites and scope. If you already have a running setup and want to learn where each bit of functionality goes, read the following page: :ref:make-your-own-bits.

Prerequisites

To run a development environment you need to have:

A running RabbitMQ service
A running Bytes API service
A ./env containing the environment variables explained below
Everything in the requirements.txt installed

Optionally, you could have an instance of the octopoes api listening on a port that receives the normalized data from the normalizers.

KATalogus

See the openAPI reference at http://localhost:8003/docs. The KATalogus has CRUD endpoints for several objects such as:

Organisation
Plugin
Setting

Organisations

Supported HTTP methods (for CRUD): POST, GET, DELETE. Includes an endpoint that lists all objects. All subsequent objects in the API are namespaced on the organisation_id.

Plugins

Supported HTTP methods (for CRUD): GET, PATCH.

Settings

Supported HTTP methods (for CRUD): POST, GET, DELETE, PUT. Includes an endpoint that lists all objects.

The KATalogus stores environment settings for the different organisations and plugins, accessible through the API. These can be encrypted by setting the BYTES_ENCRYPTION_MIDDLEWARE=NACL_SEALBOX, and the public and private key env vars. More info about the encryption scheme can be found here: https://pynacl.readthedocs.io/en/latest/public/. Currently, the settings are encrypted when stored, and returned decrypted. This could be changed in the future when the boefje-runner/plugin-code can decrypt the secrets itself, although this would be more complicated.

Environment variables

By design, Boefjes do not have access to the host system’s environment variables. If a Boefje requires access to an environment variable (e.g. HTTP_PROXY or USER_AGENT), it should note as such in its schema.json. The system-wide variables can be set as environment variable to the boefjes runner by prefixing it with BOEFJE_. This is to prevent a Boefje from accessing variables it should not have access to, such as secrets. To illustrate: if BOEFJE_HTTP_PROXY=https://proxy:8080 environment variable is configured, the Boefje can access it as HTTP_PROXY. This feature can also be used to set default values for KAT-alogus settings. For example, configuring the BOEFJE_TOP_PORTS environment variable will set the default value for the TOP_PORTS setting (used by the nmap Boefje). This default value can be overridden per organisation by setting any value for TOP_PORTS in the KAT-alogus.

Technical Design

Boefjes will run as containerized workers pulling jobs from a queue in the Scheduler:

        sequenceDiagram
    participant Boefje
    participant Rocky
    participant Scheduler
    participant Normalizer
    participant Bytes
    participant Octopoes
    Boefje->>+Scheduler: Get Boefje Task
    Scheduler-->>Scheduler: boefje_task.status = DISPATCHED
    Boefje->>Scheduler: boefje_task.status = RUNNING
    Boefje-->>Boefje: Run Boefje Task
    Boefje->>Scheduler: boefje_task.status = COMPLETED
    Boefje->>Bytes: Save Raw
    Bytes-->>Scheduler: Raw File Received
    Scheduler->>Scheduler: Push Normalizer Task
    Normalizer->>Scheduler: Get Normalizer Task
    Scheduler-->>Scheduler: normalizer_task.status = DISPATCHED
    Normalizer->>Bytes: Get Raw
    Normalizer->>Scheduler: normalizer_task.status = RUNNING
    Normalizer-->>Normalizer: Run Normalizer Task
    Normalizer->>Scheduler: normalizer_task.status = COMPLETED
    Normalizer->>Octopoes: Add object(s)

The connection between the Scheduler and Bytes is managed asynchronously through RabbitMQ.

Boefje and Normalizer Workers

When we configure a POOL_SIZE of n, we have n + 1 processes: one main process and n workers. The main process pushes to a multiprocessing.Manager.Queue and keeps track of the task that was being handled by the workers. It sets the status to failed when the worker was killed, like when the process runs out of memory and is killed by Docker. (Note: multiprocessing.Queue will not work due to qsize() not being implemented on macOS.) No maximum size is defined on the queue since we want to avoid blocking. Hence, we manually check if the queue does not pile up beyond the number of workers, i.e. n.

Parallel Workers

The setup for the main process and workers:

        graph LR

SchedulerRuntimeManager -- "pop()" --> Scheduler

subgraph Process 0

  multiprocessing.Queue
  SchedulerRuntimeManager -- "put(p_item)" --> multiprocessing.Queue["multiprocessing.Manager.Queue()"]

  Worker-1["Worker 1<br/><i>target = _start_working()"] -- "get()" --> multiprocessing.Queue

  subgraph Process 1
    Worker-1
    Worker-1 -- runs --> Plugin1["Plugin"]
  end

  Worker-2["Worker 2<br/><i>target = _start_working()"] -- "get()" --> multiprocessing.Queue

  subgraph Process 2
      Worker-2
      Worker-2 -- runs --> Plugin2["Plugin"]
  end
end

Sending a SIGKILL to a worker process

A representation of the failure mode when a SIGKILL has been sent to the worker (also see boefjes/app.py):

        sequenceDiagram
  participant SchedulerRuntimeManager
  participant SharedDict
  participant Queue
  participant Worker[pid=1]
  participant Scheduler
  participant Worker[pid=2]
  Worker[pid=1]->>SharedDict: set: SharedDict[1] = p_item.id
  Worker[pid=1]->>Worker[pid=1]: receives SIGKILL
  SchedulerRuntimeManager->>Worker[pid=1]: if not is_alive()
  SchedulerRuntimeManager->>SharedDict: set p_item_id = SharedDict[1]
  SchedulerRuntimeManager->>Scheduler: set p_item_id status to FAILED
  SchedulerRuntimeManager->>Worker[pid=1]: close()
  SchedulerRuntimeManager->>Worker[pid=2]: start()

Here, the SharedDict maps worker process IDs (PIDs) to the task they are handling.

Running as a Docker container

To run a boefje and normalizer worker as a docker container, you can run

docker build . -t boefje
docker run --rm -d --name boefje boefje python -m boefjes boefje
docker run --rm -d --name normalizer boefje python -m boefjes normalizer

Note: the worker needs a running Bytes API and RabbitMQ. The service locations can be specified with environment variables (see the Docker documentation to use a .env file with the --env-file flag, for instance).

Running the worker directly

To start the worker process listening on the job queue, use the python -m boefjes module.

$ python -m boefjes --help
Usage: python -m boefjes [OPTIONS] {boefje|normalizer}

Options:
  --log-level [DEBUG|INFO|WARNING|ERROR]
                                  Log level
  --help                          Show this message and exit.

So to start either a boefje worker or normalizer worker, run:

python -m boefjes boefje
python -m boefjes normalizer

Again, service locations can be specified with environment variables.

Example job

The job file for a DNS scan might look like this:

{
  "id": "b445dc20-c030-4bc3-b779-4ed9069a8ca2",
  "organization": "_dev",
  "boefje": {
    "id": "ssl-scan-job",
    "version": null
  },
  "input_ooi": "Hostname|internet|www.example.nl"
}

If the tool runs smoothly, the data can be accessed using the Bytes API (see Bytes documentation).

Manually running a boefje or normalizer

It is possible to manually run a boefje using

$ ./tools/run_boefje.py ORGANIZATION_CODE BOEFJE_ID INPUT_OOI

This will execute the boefje with debug logging turned on. It will log the raw file id in the output, which can be viewed using show_raw:

$ ./tools/show_raw.py RAW_ID

There is also a --json option to parse the raw file as JSON and pretty print it. The normalizer is run using:

$ ./tools/run_normalizer.py NORMALIZER_ID RAW_ID

Both run_boefje.py and run_normalizer.py support the --pdb option to enter the standard Python Debugger when an exceptions happens or breakpoint is triggered.

If you are using the standard docker compose developer setup, you can use docker compose exec to execute the commands in the container. The boefje and normalizer containers use the same images and settings, so you can use both:

$ docker compose exec boefje ./tools/run_boefje.py ORGANIZATION_CODE BOEFJE_ID INPUT_OOI

For example:

$ docker compose exec boefje ./tools/run_boefje.py myorganization dns-records "Hostname|internet|example.com"
$ docker compose exec boefje ./tools/show_raw.py --json 794986d7-cf39-4a2c-8bdf-17ae58f361ea
$ docker compose exec boefje ./tools/run_normalizer.py kat_dns_normalize 794986d7-cf39-4a2c-8bdf-17ae58f361ea

Boefje and normalizer structure

Each boefje and normalizer module is placed in boefjes/<module>. A module’s main script is usually called main.py, and a normalizer is usually called normalize.py, but it also may contain one or more normalizers. A definition file with metadata about the boefje is called boefje.py. Here a Boefje object that wraps this metadata is defined, as well as Normalizer objects that can parse this Boefje’s output. Each module may also have its own requirements.txt file that lists dependencies not included in the base requirements. Furthermore, you can add static data such as a cover and a description (markdown) to show up in Rocky’s KATalogus.

Example

$ tree boefjes/plugins/kat_dns
├── boefje.json
├── cover.jpg
├── description.md
├── __init__.py
├── main.py
├── normalize.py
├── normalizer.json
└── schema.json

Tests

To run the unit test suite, run:

$ python -m pytest

For the KATalogus integration tests, run:

$ make itest

To lint the code using pre-commit, run:

$ pre-commit run --all-files