Bytes
Bytes is a service that provides an API for reading and writing metadata of jobs and job outputs (raw files). It can also encrypt the raw data, and hash it to proof that the data was seen before a point in time (see below).
Installation
There are two ways to setup the API.
With Docker
Bytes can be fired up from the root directory of KAT using docker-compose (check out the README over there!).
To run Bytes as a standalone container, spin up a Postgresql database (e.g. using Docker),
create the database bytes and run
$ docker build . -t bytes
# Without an env-file
$ export BYTES_PASSWORD=$(openssl rand -hex 20) \
&& export BYTES_SECRET=$(openssl rand -hex 20) \
&& export BYTES_DB_URI=postgresql://USER:PWD@bytes-db:5432/bytes # change accordingly!
$ docker run --rm -p 8002:8002 -e BYTES_USERNAME=bytes -e BYTES_PASSWORD -e BYTES_SECRET -e BYTES_DB_URI bytes
# With an env-file
$ docker run --rm -p 8002:8000 --env-file=/path/to/env bytes # change accordingly!
Without Docker
To create and start a Python virtual environment, run
$ python -m venv $PWD/.venv
$ source .venv/bin/activate
To install the dependencies, assuming you are in the virtual environment, run
$ pip install -r requirements-dev.txt
Bytes depends on a Postgresql database that is configurable by the BYTES_DB_URI environment variable. See above for a minimal set of environment variables to start Bytes and
To start the API run
$ uvicorn bytes.api:app --host 127.0.0.1 --port 8002 --reload --reload-dir /app/bytes/bytes
See http://localhost:8002/docs for the OpenAPI documentation.
Hashing and Encryption
Every raw file is hashed with the current ended_at of the boefje_meta,
which functions as a ‘proof’ of it being uploaded at that time.
These proofs can be uploaded externally (a 3rd party) such that we can verify that this data was saved in the past.
Current implementations are
BYTES_EXT_HASH_REPOSITORY="IN_MEMORY"(just a stub)BYTES_EXT_HASH_REPOSITORY="PASTEBIN"(Needs pastebin API development key)BYTES_EXT_HASH_REPOSITORY="RFC3161"
For the RFC3161 implementation, see https://www.ietf.org/rfc/rfc3161.txt and https://github.com/trbs/rfc3161ng as a reference. To use this implementation, set your environment to
BYTES_EXT_HASH_REPOSITORY=RFC3161BYTES_RFC3161_PROVIDER="https://freetsa.org/tsr"(example)BYTES_RFC3161_CERT_FILE="bytes/timestamping/certificates/freetsa.crt"(example)
Adding a new implementation means implementing the bytes.repositories.hash_repository::HashRepository interface.
Bind your new implementation in bytes.timestamping.provider::create_hash_repository.
The secure-hashing-algorithm can be specified with an env var: BYTES_HASHING_ALGORITHM="SHA512".
BYTES_HASHING_ALGORITHM="SHA512"
BYTES_EXT_HASH_REPOSITORY="IN_MEMORY"
BYTES_PASTEBIN_API_DEV_KEY=""
Files in bytes can be saved encrypted to disk,
the implementation can be set using an env-var, BYTES_ENCRYPTION_MIDDLEWARE. The options are:
"IDENTITY""NACL_SEALBOX"
The "NACL_SEALBOX" option requires the BYTES_PRIVATE_KEY_B64 and BYTES_PUBLIC_KEY_B64 env vars.
Observability
Bytes exposes a /metrics endpoint for basic application level observability,
such as the amount of organizations and the amount of raw files per organization.
Another important component to monitor is the disk usage of Bytes.
It is recommended to install node exporter to keep track of this.
Design
We now include two levels of design, according to the C4 model.
Design: C2 Container level
The overall view of the code is as follows.
graph
User((User))
Rocky["Rocky<br/><i>Django App</i>"]
Bytes{"Bytes<br/><i>FastAPI App"}
RabbitMQ[["RabbitMQ<br/><i>Message Broker"]]
Scheduler["Scheduler<br/><i>Software System"]
Boefjes["Boefjes<br/><i>Python App"]
Boefjes -- GET/POST Raw/Meta --> Bytes
User -- Interacts with --> Rocky
Rocky -- GET/POST Raw/Meta --> Bytes
Bytes -- "publish(RawFileReceived)" --> RabbitMQ
Scheduler --"subscribe(RawFileReceived)"--> RabbitMQ
Scheduler --"GET BoefjeMeta"--> Bytes
Design: C3 Component level
The overall view of the code is as follows.
graph LR
User -- BoefjeMeta --> APIR1
User -- NormalizerMeta --> APIR2
User -- RawFile --> APIR3
User[User]
APIR1 -- save BoefjeMeta --> MR
APIR2 -- save NormalizerMeta --> MR
APIR3 -- save RawFile --> MR
subgraph API["Bytes API"]
APIR1[API Route]
APIR2[API Route]
APIR3[API Route]
end
subgraph Bytes["Bytes Domain"]
APIR3 -- "publish(RawFileReceived)" --> EM[EventManager]
MR[Meta Repository] -- Raw --> H[Hasher] -- Hash --> MR[Meta Repository]
MR[Meta Repository] -- save Hash --> HR[Hash Repository]
MR[Meta Repository] -- save RawFile --> R[Raw Repository]
R[Raw Repository] -- RawFile --> F[FileMiddleware]
end
F[FileMiddleware] -- Encrypted Data --> Disk[[Disk]] -- Encrypted Data --> F[FileMiddleware]
HR[Hash Repository] -- Hash --> T[[Third Party]]
MR[Meta Repository] -- BoefjeMeta/NormalizerMeta --> RDB[(Psql)]
EM[EventManager] -- "{'event_id': 123}" --> RabbitMQ[[RabbitMQ]]
This diagram roughly covers the C4 level as well, as this is a small service that can be regarded as one component.
Development
The Makefile provides useful targets to use during development. To see the options run
$ make help
Code style and tests
All the code style and linting checks are done by running
$ make check
The unit and integration tests targets are utest and itest respectively.
To run all test, run
$ make test
To make sure all github actions (checks and tests) pass, run
$ make done
Ideally, you run this before each commit. Passing all the checks and tests in this target should ensure the github actions pass.
Migrations
To make a new migration file and run the migration, run
$ make migrations m='Some migration message'
$ make migrate
Export SQL migrations
To export raw SQL from the SQLAlchemy migration files, run the following target (for the diff between 0003 and 0004):
$ make sql rev1=0003 rev2=0004 > sql_migrations/0004_change_x_to_y_add_column_z.sql
Production
Performance tuning
Bytes caches some metrics for performance, but the default is not to cache these queries.
It is recommended to tune the BYTES_METRICS_TTL_SECONDS variable to on the amount of calls to the /metrics endpoint.
As a guideline, add at least 10 seconds to the cache for every million of raw files in the database.