Weekly maintain live server. / Splitcells™ Network

Issue number: #8

Task Description

The static server is hosted at Hetzner as well. It is the high available primary access to the project, as its tech stack is very simple.

The live server is hosted at Hetzner. The server settings of Hetzner can be accessed here. The network traffic is included in the fixed package price of Hetzner and no additional fees are charged, as long as no additional network expansions are bought for the package.

Service

The server is publicly available at http://live.splitcells.net
Update server.
- Upgrade major version of OS when available.
- Update deployed software.
Improve deployment and its processes.
Test security
Test legalities and privacy policy.
- Check via browser, if really no cookies are set.
Check via browser, if there are client side errors.
Check htop.

Open Tasks

[ ] Upgrade Debian.
[ ] Create dedicated logging services.
- [x] Move from Dockerfile to Podman compose. -> Create dedicated docker compose for additional optional infrastructure.
- [x] Setup metrics server: https://prometheus.io/docs/prometheus/latest/installation/
- [x] Start Podman compose on server startup via a systemd user service.
- [ ] Get access to server via SSH port-forwarding scripts.
- [ ] Setup visualisation server based on Grafana:
- [ ] Adapt logging to Prometheus via vendor-agnostic OpenTelemetry: use Prometheus server, if it is found via config or convention and otherwise store logs to file as it already is.
  - [ ] Create own telemetry log API.
  - [ ] Forward own API calls to OpenTelemetry.
  - [ ] Forward Telemetry to Log library, if Prometheus server is not reachable.
  - [ ] Forward Telemetry to Log library, if logging to via OpenTelemetry to Prometheus server stops working at any time (i.e. caused by a connection issue).
- [ ] Log JVM metrics: https://prometheus.github.io/client_java/instrumentation/jvm/
  - [ ] CPU usage
  - [ ] Memory usage
  - [ ] JVM Garbage Collector Metrics
  - [ ] JVM Thread Metrics
- [x] Setup Java profiling:
  - [x] https://grafana.com/docs/pyroscope/latest/configure-client/grafana-alloy/java/
  - [x] https://grafana.com/docs/pyroscope/latest/configure-client/language-sdks/java/
  - [x] Make Pyroscope agent optional in Java based execution via worker execute. -> Pyroscope is done via Java code only instead, as it is more flexible and minimize the worker execute API.
  - [o] Determine Pyroscope agent name via POM automatically, instead of hard coding the jar's name. -> The agent is not used and therefore the name does not have to be determined.
  - [x] https://grafana.com/docs/pyroscope/latest/configure-client/grafana-alloy/ebpf/setup-docker/
  - [x] Clean up Dem#startPyroscope. Is Dem#startPyroscope needed at all? -> Dem#startPyroscope was moved to PyroscopeService.
  - [x] Clean up setup.monitoring.sh.
  - [x] Configure host.docker.internal correctly.
  - [x] Ensure that Prometheus data is persisted accross restarts.
  - [x] Use different port than 9090 as it conflicts with cockpit.
  - [x] Re enable cockpit service.
  - [x] On the live server, Grafana cannot reach Prometheus, but can reach the Pyroscope service.
  - [x] Create an easy to use Grafana connection command.
  - [o] Redeploy Monitoring services semi-automatically. -> This is not needed for now.
  - [x] Fix monitoring services versions and organize its updates. -> Using latest is good enough for now. When the first problem appears, we will fix the version, as implementing an update workflow right now has limited advantages.
  - [x] Setup performance engineering service task.
- [ ] Consider creating a VPN for accessing the server instead of port-forwarding.
- [ ] Note, that this is done as this is generic functionality. It also allows one to do complex analysis and monitoring. Furthermore, the telemetry services are completely optional and the server will work, if these are down, after a restart, without a restart and without any additional config. In general, additional services like a database are ok, as long as these are optional.
- [ ] Send vert.x log to Prometheus as well.
- [ ] Note that some logs will still be safe locally in the future for error databases, where a Prometheus integration does not make sense.
- [ ] Check for better log viewers in bash as an alternative to a full-blown prometheus, as this would simplify the deployment. -> Java Profiling is important enough in order to set up this stack. Note this.
- [ ] Consider https://last9.io/blog/prometheus-with-docker-compose/ for advanced functionality.
[ ] Host CPU/Memory Utilization page does not work. -> Delete these pages, when Prometheus and Grafana is set up.
- [ ] https://live.splitcells.net/net/splitcells/host/resource/cpu/utilization.csv.html
- [ ] https://live.splitcells.net/net/splitcells/host/resource/memory/utilization.csv.html
[ ] Create error reporter page, that lists all errors without duplicates and not the complete log.
- [ ] Status of UI tests and tester
[ ] Provide debug port for Java service over SSH based port forwarding.
[ ] Safe user credentials as salted hashes.
[ ] If external ACME server is not available, but the certificate is still valid, that service should be able to start successfully and not crash at start.
[ ] Declare a data protection officer.
[ ] Make privacy policy of live and static server the same.
[ ] Reset the git repos, in order to prevent an unexpected state.
[ ] Log UI test runtime performance.
[ ] Synchronize Playwright in Container created by network.execute and in Network Bom, in order to avoid some Playwright integration issues.
[ ] Sometimes submitting an optimization does not work.
[ ] This was caused by a bug in the LookupManager, when the persisted lookup got enabled.
- [ ] Add a test for submitting optimization to the daily Codeberg CI.
- [ ] Create test for lookup manager.
[ ] Create an admin page, where all distinct errors can be viewed.
- [ ] Add command to delete 1 error from the view.
[ ] Make logs smaller.
[ ] Reset .m2 folder, in order to prevent an unexpected state.
[ ] Create test workers like htmlClient, but without a browser, because currently the browser tests seem to be kind of unreliable. The reason for that, is that something goes wrong after a while in the Playwright integration. There always new problems and tests
- [ ] Document the goal of non GUI test workers.
- [ ] Consider HTML/Javascript client written purely in Java as well, in order to avoid the problems with Playwright.
- [ ] For this the ProjectsRenderer needs to be a Dem Option.
  - [x] Create ProjectsRendererOption.
  - [ ] Initialize at least, when the Live Distro or its Dev is run.
[ ] Fix memory leak in main Java service, that get it killed by the OS in 2 days. See Main service killed by OOM killer after 2 days..
- [o] Restart the application every Sunday once at 1 hour after midnight. -> It worked for some days. It seems to be better to let the program run as long as possible, in order to find some issues.
- [ ] Every program exit should cause a heap dump, for better maintenance.
  - [x] Core dumps are created by default on JVM crashes. These should be enough. Set -XX:ErrorFile= for the JVM, so core dumps are persisted and can be analyzed.
  - [ ] Delete all core dumps older than 7 days, as these could contain private information. Do this in the worker.execute command via a generated execution script, when the flag --class-for-execution flag is used, because the worker.execute determines, that the core dumps are created and determines its location.
    - [ ] Document this in the arg doc of the network worker and note, that this is done in order to comply with EU's GDPR.
    - [ ] Create script template.
    - [ ] Generate script.
    - [ ] Launch script in container instead of using Java entrypoint.
[ ] Install a TUI for Docker in order to debug Forgejo runner, as it sometimes does seem to start a queued workflow a bit late.
[ ] Restart main service, when the UI testers are not working anymore. Currently, an error for the UI testers not working could not be found.
[ ] Execute runtime profiling for long-running instances and store these, in order to improve day to day performance via Grafana and Pyroscope.
[ ] Automatically and continuously check, if the SSL certificate for HTTPS is still valid and replace it automatically.
[ ] Execute the complete test suite on live server periodically and commit result data to network log.
[ ] Automatically restart server after update installation.
[ ] apt upgrade packages are seemingly not installed by unattended-upgrades. This is required for Linux kernel updates.
[ ] Make default file storage locations more sane regarding Linux home folder standard.
- [ ] Migrate files on live server accordingly.
[ ] Manage upgrading major OS versions.
[ ] The corresponding systemd service should only store logs up to 7 days.
[ ] Make private setup script public, in order to have a basis for default setup script for a server.
[ ] Run Forgejo Runner via Podman in order to not require root rights for Forgejo Runner: https://code.forgejo.org/forgejo/runner/src/branch/main/scripts/systemd.md
[ ] Do not log already logged message, in order to simplify logs on live server.
[ ] Do not output logs to standard output by default, in order to have minimal OS logs.
[ ] Consider creating double book-keeping for config files, in order to check ones, that are not used. Abort the software, when such an unused file is found.
[ ] Create backup of files.
[ ] Do disaster recovery tests.
[ ] Update certificates for ACME automatically without an explicit restart, in order to avoid these expiring during production.
[ ] Create dedicated error log or error search query.
[ ] Consider migrate from Podman to k8s.
- [ ] In order to be able to run many other things with unified infrastructure like the Codeberg runner, that currently kind of needs docker (via https://code.forgejo.org/forgejo/runner/src/branch/main/examples/kubernetes).
- [ ] Consider using https://k9scli.io/ as cli TUI.
[ ] Block outgoing connections.
[ ] Make setup script for live server open source as well.
[ ] Create 404 page for web server.
[ ] Consider automatically sending a mail, when an error happens.
[ ] Consider Nix for package management: Matthew Croughan - Use flake.nix, not Dockerfile - MCH2022
[ ] Speed up deployment via parallel module builds with mvnd.
[ ] Log public server availability via dedicated hardware.

Done Tasks

[x] Require authorization for net/splitcells/website/layout/build.html
- [x] Implement a new ProjectsRendererExtension for this. -> BuildLayoutExtension is the new implementation.
- [x] Register extension to ProjectsRenderer. -> Something is not working there. -> This was just a misunderstanding.
- [x] Delete the current implementation inside ProjectsRendererI.
[x] Speed up user.bin.configure, in order to speed up redeployment.
- [x] command.repositories.install is causing the problem.
  - [x] Provide argument so that command.managed.install.py does not install one file for each call, but a given folder completely instead.
[x] Check why so many program state folders are created. See Program States note.
- [x] Check one step at a time of the deploy.remote script.
  - [x] Avoid ~/.local/state/net.splitcells.martins.avots.distro.livedistro.
    - [x] This is created by the first bin/worker.execute.py of deploy.remote and only has an empty target folder of the public net.splitcells.network.
    - [x] The problem is probably the bin/worker.execute.py inside the generated console scripts created by bin/worker.execute.py. -> It was that and dynamic folder creation for local execution.
  - [x] What creates ~/.local/state/net.splitcells.martins.avots.distro.livedistro? -> This was fixed with the previois task.
[x] Check whether Hetzner's network cost per month is limited or not and document this. Create a Hetzner document, for administration, notes and guidelines. -> Traffic is included in the fixed package price and no additional fees are charged, as long as no additional network expansions are bought for the package. This is noted in this issue under at the task description.
[o] Playwright based test sometime do nothing. -> Playwright tests work now.
- [x] Avoid XSL errors in systemd logs.
- [ ] Maybe there is also a problem, when the submitted problem is optimized, but not fully solved. -> No, Playwright is not working.
  - [x] Minimize Systemd logs, so that Playwright errors can be found there. -> No Playwright errors are outputed to the standard out and error anymore
  - [ ] Run program in remote debug mode.
    - [ ] Add new JVM parameter -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=127.0.0.1:8000.
    - [ ] Forward port of container to host machine, but don't make the port publicly available.
    - [ ] Create command to forward the liver server debug port to local machine over SSH.
  - [ ] For each log message, also log its thread name.
  - [ ] Try restarting Playwright instance daily.
  - [x] Maybe an error in the test causes problems for Playwright. -> According to the logs, errors are recovered.
  - [x] Maybe the problem is that optimization requests are being queue, but not processed yet. Thereby, the queue grows until something breaks. -> This does not seem to be the case.
  - [x] Is Playwright present in container in the correct version? -> Yes
  - [x] Create only one Playwright server, that one browser per HtmlClient in Java.
- [ ] If nothing works, use HTMLUnit instead.
[x] Detect any deployment errors.
- [x] Maven Build
- [x] Shell Project Setup
- [x] worker.execute ist affected as well. -> This is already the case.
[x] Use only fast-forward git pulls for relevant workflows. -> This is already the case.
[x] Test if pages with authorization does work without authentication. This should not be the case. -> Authorization does not work without authentication and therefore it is working correctly.
[x] Because of Playwright TimeoutError: According to HtmlClientSharer, closing the Firefox browser with Playwright does not always work. So, starting and closing a real browser for each test run may not work with Playwright, as this can cause a resource leak. If this is the case, consider changing HtmlClientSharer so that there is pool of real browser instances, instead of just one (HTMLClients#ROOT_CLIENT). In other words, use one browser per CPU core, instead of one browser tab of a singleton browser per thread. Use this HtmlClientSharer, instead of starting and closing a browser for each test run.
- [x] Implement this.
- [x] Test podman run --security-opt seccomp=unconfined as a fix for NodeJS start. -> --security-opt seccomp=unconfined does not fix the issue, because closing a Playwright instance does not terminate all of its processes. So, in the end process are spawned until there are to many for the OS.
- [x] Note the reason, why a browser is only accessed by one thread at a time: https://github.com/microsoft/playwright-java/issues/1184
- [x] Only one browser at a time should be launched, as this also caused threading issues in the past.
- [x] Clean up the LiveDistro TODOs, if the UI tester works by now.
- [x] See chapter process/resource limits reached.
  - [x] Try --security-opt seccomp=unconfined. -> This worked.
  - [x] Document why --security-opt seccomp=unconfined is used.
  - [x] --pids-limit=-1 seems to be the actual solution. Remove --security-opt seccomp=unconfined and deploy this to live server and check results.
    - [x] Deploy --security-opt seccomp=unconfined removal. -> The deployment is broken. -> The deployment is fixed.
    - [x] Check results.
- [o] Why are tabs or their context etc. being closed? Target page, context or browser has been closed\n name='TargetClosedError\n stack='TargetClosedError: Target page, context or browser has been closed\n -> This does not need to be noted as saving resources is the default use case for closing a resource.
- [x] Clean up HtmlClientsShare and HtmlClientsSharer.
- [x] Note the reason for the error message [62.986s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 4k, detached..
- [x] Clean up HtmlClients.
[x] Because of Playwright TimeoutError: Try UI tester, that starts a browser for each test instance and then destroy it, but do not do actions over any browsers in parallel. This is like the first UI HTML client draft, but with an exclusive lock for any action on any browser. -> Closing Playwright instance does sometimes not work, which causes a resource leak. The leak leads to the crash of the container, as there are too many processes. Therefore, this does not work. Furthermore, maybe all errors was caused by the fact, that the Geal editor test was not correctly written.
- [x] Analyse how the HTML client works now. -> It has a fixed pool of browsers, where only one thread can do on any one of them at a time. The HTML client has to kill its browser after the usage is done.
- [x] The Geal editor has to replace the code editor fully first.
- [x] Correct the Playwright's locator usage and keep in mind, that access none existing thing by locators causes a timeout exception.
  - [x] One has to check the thing's presence first.
  - [x] Handle timeout exceptions and add a better message to these, so its meaning is easier to understand.
[x] Check if the error Failed to create driver at com.microsoft.playwright.impl.driver.Driver.createAndInstall(Driver.java:105) reappears. If that is the case, the reason for it must be found. A theory is, that the Playwright initial Java base setup does not work. For this the Linux journal log can be checked.
- [x] Playwright cannot download the browser sometimes, because of a network error.
  - [x] Cache Playwright's browser downloads, by caching ~/.cache/ms-playwright/ via worker.execute.
[x] Fix Failed to create driver at com.microsoft.playwright.impl.driver.Driver.createAndInstall(Driver.java:105). -> Updating and redeploying the software fixed the issue.
[x] Restart the server daily automatically.
- [x] Move automatic update to 3:00 to 3:45.
- [x] Setup daily restart configuration for at 4:00.
- [x] Check if new configuration worked.
[x] The deployed systemd service shuts down after a while. Maybe this is caused by the oneshot type or maybe this is caused by the daily server restart? -> This is caused by the fact, that the systemd service config has not configured an automated service start. -> The systemd user service was not enabled.
[x] Create and user generic worker.execute command, in order to make things portable regarding the infrastructure.
- [x] Deploy server software as systemd user service.
  - [x] Create user service.
  - [x] Make user service reachable via network.
  - [x] Start user service on server start automatically.
  - [x] Build image during build command and execute image during execute command with net.splitcells.network/bin/worker.execute, instead of net.splitcells.network.worker/bin/worker.execute. Currently, the build command builds the Java part and the execute command builds the container image.
    - [x] Merge worker.execute.* commands into one worker.execute command.
      - [x] worker.execute is command with file storage.
        
        [x] Use more descriptive names for $1 amd $2.
      - [x] worker.program is command without file storage.
      - [x] worker.service is command to execute command in detached mode.
    - [o] Consider worker.bootstrap.remote.at.
    - [x] Add parameter to worker.execute in order to build a project at the current folder in a standardized way.
      - [o] Consolidate worker.repo.build.
    - [o] Create flag for worker.execute command, in order to execute program based on files created via worker.build.
    - [x] Create flag in order to execute program as a persistent service.
- [x] Delete obsolete net.splitcells.network.worker repo.
- [x] Use this command for existing test deployment commands as well. This tests whether this new command is portable or not.
  - [o] deploy.build.at -> This command is deleted.
  - [o] deploy.test.extensively.at -> This command is deleted.
- [x] Build everything via mvn clean install at net.splitcells.network.hub.
- [x] Simplify deploy.remote.
[x] Correct download logs command.
[x] Support flat folder on Java side.
[x] Set program name to net.splitcells.martins.avots.distro.
[x] Execute more test at once, in order to have a better load test on production.
[x] Create UI tester for text editor as well, in order to test both.
[o] Browser tests are not always working. Log message: Target page, context or browser has been closed -> The warning log Closing HTML clients is implemented, but is not actually expected to be used in production. with its stack-trace was implemented and deployed in order to find the reason for this error. For now this task is closed, as this only appears sometimes. When the warning or log message is found again it will be attempted again.
[x] Fix JS errors in Gel's UI editor. -> JQuery was not unpacked by Maven, as there was a silent dependency resolution error.
[x] Avoid deadlock in HTML client factory.
[x] Playwright is not working anymore.
- [x] Install Playwright dependencies via Maven, so that the dependencies are more consistent. See Playwright Notes.
- [x] Try using only one browser playwright instance at a time.
- [x] Use public domain for Playwright based tests, so that the certificate can be accepted by the browser.
- [o] Try fixing Playwright's potential race condition, while still maintaining multiple Playwright instances. -> Using one Playwright instance, that is shared across multiple testers makes this work and is even more performant.
- [x] Use only one browser instance and one browser tab for each tester instead, in order to avoid process leak in Playwright. Playwright does not seem to close all processes/threads after the browser and Playwright is closed in Java, as many Socket Process and Utility Process processes with dedicated PIDs were found on live server.
- [ ] There seems to be a race condition regarding the close method. -> This may be caused by not closing the tabs of the HTML clients after the test. Currently, it is not the close method, but the newPage method of Playwright instead. Make recycle browser tabs like the browser itself.
  - [x] Playwright is not thread safe. Use one dedicated instance per thread.
    - [x] Simplify Playwright factory implementation by using a thread safe queue instead.
    - [x] Provide thread safe queue implementation.
  - [x] A solution is implemented and deployed, that provides one browser per live tester thread. Check later, if this solution is working for more than 24 hours.
[x] Avoiding sharing document files in worker.execute by default.
[x] Pull source code from Codeberg instead of GitHub.
[x] Avoid logging to stdout and stderr, in order to have a clean systemd log.
[x] Correct deployment via worker on live server. Currently, it is completely broken as the file paths have changed.
[x] Automatic upgrade does not always work. There is sometimes a difference between unattended-upgrades (with apt-daily and apt-daily-upgrade) and apt update && apt upgrade --yes.
- [o] Create own automatic restart service, if this gets too complicated. It already cost too many hours. Also keep in mind that unattended-upgrades config is very complex and therefore already an argument in itself to replace it with simple custom command. Especially, when the debug log is so bad, because one does not see the concrete APT/dpkg actions in the log. If this is done, document this reasoning. -> The unattended-upgrades usage is fixed, so this is not needed for now.
- [x] Check if unattended-upgrades is working with some fixes.
  - [ ] If this works, persist fixes in private git repo.
  - [x] Try solving the problem via Origins-Pattern of "origin=*"; and "o=*";.
  - [x] Expand /etc/apt/apt.conf.d/20auto-upgrades with APT::Periodic::Enable "1";.
  - [x] Expand /etc/apt/apt.conf.d/50unattended-upgrades with Unattended-Upgrade::SyslogEnable "true";Unattended-Upgrade::SyslogFacility "daemon";Unattended-Upgrade::Verbose "true";.
  - [x] Expand /etc/apt/apt.conf.d/50unattended-upgrades with Unattended-Upgrade::Debug "false";.
[x] Make unattended-upgrades work.
[x] Do not require loginctl enable-linger in order to run Podman container without ssh session, in order to ensure, that all programs of ssh sessions are closed. -> The container is now run via a systemd user service and therefore loginctl enable-linger is not needed anymore.
[x] Create double-checking for every config step. -> Check description is present in config script.

Task Description

Service

Open Tasks

Done Tasks

Messages