Bytes

Bytes is a service that provides an API for reading and writing metadata of jobs and job outputs (raw files). It can also encrypt the raw data, and hash it to proof that the data was seen before a point in time (see below).

Installation

There are two ways to setup the API.

With Docker

Bytes can be fired up from the root directory of KAT using docker-compose (check out the README over there!).

To run Bytes as a standalone container, spin up a Postgresql database (e.g. using Docker), create the database bytes and run

$ docker build . -t bytes

# Without an env-file
$ export BYTES_PASSWORD=$(openssl rand -hex 20) \
    && export BYTES_SECRET=$(openssl rand -hex 20) \
    && export BYTES_DB_URI=postgresql://USER:PWD@bytes-db:5432/bytes  # change accordingly!
$ docker run --rm -p 8002:8002 -e BYTES_USERNAME=bytes -e BYTES_PASSWORD -e BYTES_SECRET -e BYTES_DB_URI bytes


# With an env-file
$ docker run --rm -p 8002:8000 --env-file=/path/to/env bytes  # change accordingly!

Without Docker

To create and start a Python virtual environment, run

$ python -m venv $PWD/.venv
$ source .venv/bin/activate

To install the dependencies, assuming you are in the virtual environment, run

$ pip install -r requirements-dev.txt

Bytes depends on a Postgresql database that is configurable by the BYTES_DB_URI environment variable. See above for a minimal set of environment variables to start Bytes and

To start the API run

$ uvicorn bytes.api:app --host 127.0.0.1 --port 8002 --reload --reload-dir /app/bytes/bytes

See http://localhost:8002/docs for the OpenAPI documentation.

Hashing and Encryption

Every raw file is hashed with the current ended_at of the boefje_meta, which functions as a ‘proof’ of it being uploaded at that time. These proofs can be uploaded externally (a 3rd party) such that we can verify that this data was saved in the past.

Current implementations are

BYTES_EXT_HASH_REPOSITORY="IN_MEMORY" (just a stub)
BYTES_EXT_HASH_REPOSITORY="PASTEBIN" (Needs pastebin API development key)
BYTES_EXT_HASH_REPOSITORY="RFC3161"

For the RFC3161 implementation, see https://www.ietf.org/rfc/rfc3161.txt and https://github.com/trbs/rfc3161ng as a reference. To use this implementation, set your environment to

BYTES_EXT_HASH_REPOSITORY=RFC3161
BYTES_RFC3161_PROVIDER="https://freetsa.org/tsr" (example)
BYTES_RFC3161_CERT_FILE="bytes/timestamping/certificates/freetsa.crt" (example)

Adding a new implementation means implementing the bytes.repositories.hash_repository::HashRepository interface. Bind your new implementation in bytes.timestamping.provider::create_hash_repository.

The secure-hashing-algorithm can be specified with an env var: BYTES_HASHING_ALGORITHM="SHA512".

BYTES_HASHING_ALGORITHM="SHA512"
BYTES_EXT_HASH_REPOSITORY="IN_MEMORY"
BYTES_PASTEBIN_API_DEV_KEY=""

Files in bytes can be saved encrypted to disk, the implementation can be set using an env-var, BYTES_ENCRYPTION_MIDDLEWARE. The options are:

"IDENTITY"
"NACL_SEALBOX"

The "NACL_SEALBOX" option requires the BYTES_PRIVATE_KEY_B64 and BYTES_PUBLIC_KEY_B64 env vars.

Observability

Bytes exposes a /metrics endpoint for basic application level observability, such as the amount of organizations and the amount of raw files per organization. Another important component to monitor is the disk usage of Bytes. It is recommended to install node exporter to keep track of this.

Design

We now include two levels of design, according to the C4 model.

Design: C2 Container level

The overall view of the code is as follows.

graph User((User)) Rocky["Rocky Django App"] Bytes{"Bytes FastAPI App"} RabbitMQ[["RabbitMQ Message Broker"]] Scheduler["Scheduler Software System"] Boefjes["Boefjes Python App"] Boefjes -- GET/POST Raw/Meta --> Bytes User -- Interacts with --> Rocky Rocky -- GET/POST Raw/Meta --> Bytes Bytes -- "publish(RawFileReceived)" --> RabbitMQ Scheduler --"subscribe(RawFileReceived)"--> RabbitMQ Scheduler --"GET BoefjeMeta"--> Bytes

Design: C3 Component level

The overall view of the code is as follows.

graph LR User -- BoefjeMeta --> APIR1 User -- NormalizerMeta --> APIR2 User -- RawFile --> APIR3 User[User] APIR1 -- save BoefjeMeta --> MR APIR2 -- save NormalizerMeta --> MR APIR3 -- save RawFile --> MR subgraph API["Bytes API"] APIR1[API Route] APIR2[API Route] APIR3[API Route] end subgraph Bytes["Bytes Domain"] APIR3 -- "publish(RawFileReceived)" --> EM[EventManager] MR[Meta Repository] -- Raw --> H[Hasher] -- Hash --> MR[Meta Repository] MR[Meta Repository] -- save Hash --> HR[Hash Repository] MR[Meta Repository] -- save RawFile --> R[Raw Repository] R[Raw Repository] -- RawFile --> F[FileMiddleware] end F[FileMiddleware] -- Encrypted Data --> Disk[[Disk]] -- Encrypted Data --> F[FileMiddleware] HR[Hash Repository] -- Hash --> T[[Third Party]] MR[Meta Repository] -- BoefjeMeta/NormalizerMeta --> RDB[(Psql)] EM[EventManager] -- "{'event_id': 123}" --> RabbitMQ[[RabbitMQ]]

This diagram roughly covers the C4 level as well, as this is a small service that can be regarded as one component.

Development

The Makefile provides useful targets to use during development. To see the options run

$ make help

Code style and tests

All the code style and linting checks are done by running

$ make check

The unit and integration tests targets are utest and itest respectively. To run all test, run

$ make test

To make sure all github actions (checks and tests) pass, run

$ make done

Ideally, you run this before each commit. Passing all the checks and tests in this target should ensure the github actions pass.

Migrations

To make a new migration file and run the migration, run

$ make migrations m='Some migration message'
$ make migrate

Export SQL migrations

To export raw SQL from the SQLAlchemy migration files, run the following target (for the diff between 0003 and 0004):

$ make sql rev1=0003 rev2=0004 > sql_migrations/0004_change_x_to_y_add_column_z.sql

Production

Performance tuning

Bytes caches some metrics for performance, but the default is not to cache these queries. It is recommended to tune the BYTES_METRICS_TTL_SECONDS variable to on the amount of calls to the /metrics endpoint. As a guideline, add at least 10 seconds to the cache for every million of raw files in the database.