- Issue number: #8
Task Description
The static server is hosted at Hetzner as well. It is the high available primary access to the project, as its tech stack is very simple.
The live server is hosted at Hetzner. The server settings of Hetzner can be accessed here. The network traffic is included in the fixed package price of Hetzner and no additional fees are charged, as long as no additional network expansions are bought for the package.
Service
- The server is publicly available at http://live.splitcells.net
- Update server.
- Upgrade major version of OS when available.
- Update deployed software.
- Improve deployment and its processes.
- Test security
- Test legalities and privacy policy.
- Check via browser, if really no cookies are set.
- Check via browser, if there are client side errors.
- Check htop.
Open Tasks
- [ ] Check why so many program state folders are created. See
Program Statesnote.- [ ] Check one step at a time of the
deploy.remotescript.
- [ ] Check one step at a time of the
- [ ] Speed up
user.bin.configure, in order to speed up redeployment. - [ ] Upgrade Debian.
- [ ] Create dedicated logging services.
- [x] Move from Dockerfile to Podman compose. -> Create dedicated docker compose for additional optional infrastructure.
- [x] Setup metrics server: https://prometheus.io/docs/prometheus/latest/installation/
- [x] Start Podman compose on server startup via a systemd user service.
- [ ] Get access to server via SSH port-forwarding scripts.
- [ ] Setup visualisation server based on Grafana:
- [ ] Adapt logging to Prometheus via vendor-agnostic OpenTelemetry: use Prometheus server, if it is found via config or convention and otherwise store logs to file as it already is.
- [ ] Create own telemetry log API.
- [ ] Forward own API calls to OpenTelemetry.
- [ ] Forward Telemetry to Log library, if Prometheus server is not reachable.
- [ ] Forward Telemetry to Log library, if logging to via OpenTelemetry to Prometheus server stops working at any time (i.e. caused by a connection issue).
- [ ] Log JVM metrics: https://prometheus.github.io/client_java/instrumentation/jvm/
- [ ] CPU usage
- [ ] Memory usage
- [ ] JVM Garbage Collector Metrics
- [ ] JVM Thread Metrics
- [x] Setup Java profiling:
- [x] https://grafana.com/docs/pyroscope/latest/configure-client/grafana-alloy/java/
- [x] https://grafana.com/docs/pyroscope/latest/configure-client/language-sdks/java/
- [x] Make Pyroscope agent optional in Java based execution via worker execute. -> Pyroscope is done via Java code only instead, as it is more flexible and minimize the worker execute API.
- [o] Determine Pyroscope agent name via POM automatically, instead of hard coding the jar's name. -> The agent is not used and therefore the name does not have to be determined.
- [x] https://grafana.com/docs/pyroscope/latest/configure-client/grafana-alloy/ebpf/setup-docker/
- [x] Clean up
Dem#startPyroscope. IsDem#startPyroscopeneeded at all? ->Dem#startPyroscopewas moved to PyroscopeService. - [x] Clean up
setup.monitoring.sh. - [x] Configure host.docker.internal correctly.
- [x] Ensure that Prometheus data is persisted accross restarts.
- [x] Use different port than 9090 as it conflicts with cockpit.
- [x] Re enable cockpit service.
- [x] On the live server, Grafana cannot reach Prometheus, but can reach the Pyroscope service.
- [x] Create an easy to use Grafana connection command.
- [o] Redeploy Monitoring services semi-automatically. -> This is not needed for now.
- [x] Fix monitoring services versions and organize its updates. -> Using latest is good enough for now. When the first problem appears, we will fix the version, as implementing an update workflow right now has limited advantages.
- [x] Setup performance engineering service task.
- [ ] Consider creating a VPN for accessing the server instead of port-forwarding.
- [ ] Note, that this is done as this is generic functionality. It also allows one to do complex analysis and monitoring. Furthermore, the telemetry services are completely optional and the server will work, if these are down, after a restart, without a restart and without any additional config. In general, additional services like a database are ok, as long as these are optional.
- [ ] Send vert.x log to Prometheus as well.
- [ ] Note that some logs will still be safe locally in the future for error databases, where a Prometheus integration does not make sense.
- [ ] Check for better log viewers in bash as an alternative to a full-blown prometheus, as this would simplify the deployment. -> Java Profiling is important enough in order to set up this stack. Note this.
- [ ] Consider https://last9.io/blog/prometheus-with-docker-compose/ for advanced functionality.
- [ ] Host CPU/Memory Utilization page does not work. -> Delete these pages, when Prometheus and Grafana is set up.
- [ ] https://live.splitcells.net/net/splitcells/host/resource/cpu/utilization.csv.html
- [ ] https://live.splitcells.net/net/splitcells/host/resource/memory/utilization.csv.html
- [ ] Create error reporter page, that lists all errors without duplicates and not the complete log.
- [ ] Status of UI tests and tester
- [ ] Provide debug port for Java service over SSH based port forwarding.
- [ ] Safe user credentials as salted hashes.
- [ ] If external ACME server is not available, but the certificate is still valid, that service should be able to start successfully and not crash at start.
- [ ] Declare a data protection officer.
- [ ] Make privacy policy of live and static server the same.
- [ ] Reset the git repos, in order to prevent an unexpected state.
- [ ] Log UI test runtime performance.
- [ ] Synchronize Playwright in Container created by
network.executeand in Network Bom, in order to avoid some Playwright integration issues. - [ ] Sometimes submitting an optimization does not work.
- [ ] This was caused by a bug in the LookupManager, when the persisted lookup got enabled.
- [ ] Add a test for submitting optimization to the daily Codeberg CI.
- [ ] Create test for lookup manager.
- [ ] Create an admin page, where all distinct errors can be viewed.
- [ ] Add command to delete 1 error from the view.
- [ ] Make logs smaller.
- [ ] Reset .m2 folder, in order to prevent an unexpected state.
- [ ] Create test workers like htmlClient, but without a browser, because currently the browser tests seem to be kind of unreliable.
The reason for that, is that something goes wrong after a while in the Playwright integration.
There always new problems and tests
- [ ] Document the goal of non GUI test workers.
- [ ] Consider HTML/Javascript client written purely in Java as well, in order to avoid the problems with Playwright.
- [ ] For this the ProjectsRenderer needs to be a Dem Option.
- [x] Create ProjectsRendererOption.
- [ ] Initialize at least, when the Live Distro or its Dev is run.
- [ ] Fix memory leak in main Java service, that get it killed by the OS in 2 days.
See
Main service killed by OOM killer after 2 days..- [o] Restart the application every Sunday once at 1 hour after midnight. -> It worked for some days. It seems to be better to let the program run as long as possible, in order to find some issues.
- [ ] Every program exit should cause a heap dump, for better maintenance.
- [x] Core dumps are created by default on JVM crashes. These should be enough.
Set
-XX:ErrorFile=for the JVM, so core dumps are persisted and can be analyzed. - [ ] Delete all core dumps older than 7 days, as these could contain private information.
Do this in the
worker.executecommand via a generated execution script, when the flag--class-for-executionflag is used, because theworker.executedetermines, that the core dumps are created and determines its location.- [ ] Document this in the arg doc of the network worker and note, that this is done in order to comply with EU's GDPR.
- [ ] Create script template.
- [ ] Generate script.
- [ ] Launch script in container instead of using Java entrypoint.
- [x] Core dumps are created by default on JVM crashes. These should be enough.
Set
- [ ] Install a TUI for Docker in order to debug Forgejo runner, as it sometimes does seem to start a queued workflow a bit late.
- [ ] Restart main service, when the UI testers are not working anymore. Currently, an error for the UI testers not working could not be found.
- [ ] Execute runtime profiling for long-running instances and store these, in order to improve day to day performance via Grafana and Pyroscope.
- [ ] Automatically and continuously check, if the SSL certificate for HTTPS is still valid and replace it automatically.
- [ ] Execute the complete test suite on live server periodically and commit result data to network log.
- [ ] Automatically restart server after update installation.
- [ ]
apt upgradepackages are seemingly not installed by unattended-upgrades. This is required for Linux kernel updates. - [ ] Make default file storage locations more sane regarding Linux home folder standard.
- [ ] Migrate files on live server accordingly.
- [ ] Manage upgrading major OS versions.
- [ ] The corresponding systemd service should only store logs up to 7 days.
- [ ] Make private setup script public, in order to have a basis for default setup script for a server.
- [ ] Run Forgejo Runner via Podman in order to not require root rights for Forgejo Runner: https://code.forgejo.org/forgejo/runner/src/branch/main/scripts/systemd.md
- [ ] Do not log already logged message, in order to simplify logs on live server.
- [ ] Do not output logs to standard output by default, in order to have minimal OS logs.
- [ ] Consider creating double book-keeping for config files, in order to check ones, that are not used. Abort the software, when such an unused file is found.
- [ ] Create backup of files.
- [ ] Do disaster recovery tests.
- [ ] Update certificates for ACME automatically without an explicit restart, in order to avoid these expiring during production.
- [ ] Create dedicated error log or error search query.
- [ ] Consider migrate from Podman to k8s.
- [ ] In order to be able to run many other things with unified infrastructure like the Codeberg runner, that currently kind of needs docker (via https://code.forgejo.org/forgejo/runner/src/branch/main/examples/kubernetes).
- [ ] Consider using https://k9scli.io/ as cli TUI.
- [ ] Block outgoing connections.
- [ ] Make setup script for live server open source as well.
- [ ] Create 404 page for web server.
- [ ] Consider automatically sending a mail, when an error happens.
- [ ] Consider Nix for package management: Matthew Croughan - Use flake.nix, not Dockerfile - MCH2022
- [ ] Speed up deployment via parallel module builds with mvnd.
- [ ] Log public server availability via dedicated hardware.
Done Tasks
- [x] Check whether Hetzner's network cost per month is limited or not and document this. Create a Hetzner document, for administration, notes and guidelines. -> Traffic is included in the fixed package price and no additional fees are charged, as long as no additional network expansions are bought for the package. This is noted in this issue under at the task description.
- [o] Playwright based test sometime do nothing. -> Playwright tests work now.
- [x] Avoid XSL errors in systemd logs.
- [ ] Maybe there is also a problem, when the submitted problem is optimized, but not fully solved. -> No, Playwright is not working.
- [x] Minimize Systemd logs, so that Playwright errors can be found there. -> No Playwright errors are outputed to the standard out and error anymore
- [ ] Run program in remote debug mode.
- [ ] Add new JVM parameter
-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=127.0.0.1:8000. - [ ] Forward port of container to host machine, but don't make the port publicly available.
- [ ] Create command to forward the liver server debug port to local machine over SSH.
- [ ] Add new JVM parameter
- [ ] For each log message, also log its thread name.
- [ ] Try restarting Playwright instance daily.
- [x] Maybe an error in the test causes problems for Playwright. -> According to the logs, errors are recovered.
- [x] Maybe the problem is that optimization requests are being queue, but not processed yet. Thereby, the queue grows until something breaks. -> This does not seem to be the case.
- [x] Is Playwright present in container in the correct version? -> Yes
- [x] Create only one Playwright server, that one browser per HtmlClient in Java.
- [ ] If nothing works, use HTMLUnit instead.
- [x] Detect any deployment errors.
- [x] Maven Build
- [x] Shell Project Setup
- [x] worker.execute ist affected as well. -> This is already the case.
- [x] Use only fast-forward git pulls for relevant workflows. -> This is already the case.
- [x] Test if pages with authorization does work without authentication. This should not be the case. -> Authorization does not work without authentication and therefore it is working correctly.
- [x] Because of Playwright TimeoutError:
According to HtmlClientSharer, closing the Firefox browser with Playwright does not always work.
So, starting and closing a real browser for each test run may not work with Playwright,
as this can cause a resource leak.
If this is the case, consider changing HtmlClientSharer so that there is pool of real browser instances,
instead of just one (
HTMLClients#ROOT_CLIENT). In other words, use one browser per CPU core, instead of one browser tab of a singleton browser per thread. Use this HtmlClientSharer, instead of starting and closing a browser for each test run.- [x] Implement this.
- [x] Test
podman run --security-opt seccomp=unconfinedas a fix for NodeJS start. ->--security-opt seccomp=unconfineddoes not fix the issue, because closing a Playwright instance does not terminate all of its processes. So, in the end process are spawned until there are to many for the OS. - [x] Note the reason, why a browser is only accessed by one thread at a time: https://github.com/microsoft/playwright-java/issues/1184
- [x] Only one browser at a time should be launched, as this also caused threading issues in the past.
- [x] Clean up the LiveDistro TODOs, if the UI tester works by now.
- [x] See chapter
process/resource limits reached.- [x] Try
--security-opt seccomp=unconfined. -> This worked. - [x] Document why
--security-opt seccomp=unconfinedis used. - [x]
--pids-limit=-1seems to be the actual solution. Remove--security-opt seccomp=unconfinedand deploy this to live server and check results.- [x] Deploy
--security-opt seccomp=unconfinedremoval. -> The deployment is broken. -> The deployment is fixed. - [x] Check results.
- [x] Deploy
- [x] Try
- [o] Why are tabs or their context etc. being closed?
Target page, context or browser has been closed\n name='TargetClosedError\n stack='TargetClosedError: Target page, context or browser has been closed\n-> This does not need to be noted as saving resources is the default use case for closing a resource. - [x] Clean up HtmlClientsShare and HtmlClientsSharer.
- [x] Note the reason for the error message
[62.986s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 4k, detached.. - [x] Clean up HtmlClients.
- [x] Because of Playwright TimeoutError: Try UI tester, that starts a browser for each test instance and then destroy it,
but do not do actions over any browsers in parallel.
This is like the first UI HTML client draft, but with an exclusive lock for any action on any browser.
-> Closing Playwright instance does sometimes not work, which causes a resource leak.
The leak leads to the crash of the container, as there are too many processes.
Therefore, this does not work.
Furthermore, maybe all errors was caused by the fact, that the Geal editor test was not correctly written.
- [x] Analyse how the HTML client works now. -> It has a fixed pool of browsers, where only one thread can do on any one of them at a time. The HTML client has to kill its browser after the usage is done.
- [x] The Geal editor has to replace the code editor fully first.
- [x] Correct the Playwright's locator usage and keep in mind,
that access none existing thing by locators causes a timeout exception.
- [x] One has to check the thing's presence first.
- [x] Handle timeout exceptions and add a better message to these, so its meaning is easier to understand.
- [x] Check if the error
Failed to create driveratcom.microsoft.playwright.impl.driver.Driver.createAndInstall(Driver.java:105)reappears. If that is the case, the reason for it must be found. A theory is, that the Playwright initial Java base setup does not work. For this the Linux journal log can be checked.- [x] Playwright cannot download the browser sometimes, because of a network error.
- [x] Cache Playwright's browser downloads, by caching
~/.cache/ms-playwright/viaworker.execute.
- [x] Cache Playwright's browser downloads, by caching
- [x] Playwright cannot download the browser sometimes, because of a network error.
- [x] Fix
Failed to create driveratcom.microsoft.playwright.impl.driver.Driver.createAndInstall(Driver.java:105). -> Updating and redeploying the software fixed the issue. - [x] Restart the server daily automatically.
- [x] Move automatic update to 3:00 to 3:45.
- [x] Setup daily restart configuration for at 4:00.
- [x] Check if new configuration worked.
- [x] The deployed systemd service shuts down after a while. Maybe this is caused by the oneshot type or maybe this is caused by the daily server restart? -> This is caused by the fact, that the systemd service config has not configured an automated service start. -> The systemd user service was not enabled.
- [x] Create and user generic
worker.executecommand, in order to make things portable regarding the infrastructure.- [x] Deploy server software as systemd user service.
- [x] Create user service.
- [x] Make user service reachable via network.
- [x] Start user service on server start automatically.
- [x] Build image during build command and execute image during execute command with
net.splitcells.network/bin/worker.execute, instead ofnet.splitcells.network.worker/bin/worker.execute. Currently, the build command builds the Java part and the execute command builds the container image.- [x] Merge
worker.execute.*commands into oneworker.executecommand.- [x]
worker.executeis command with file storage.- [x] Use more descriptive names for
$1amd$2.
- [x] Use more descriptive names for
- [x]
worker.programis command without file storage. - [x]
worker.serviceis command to execute command in detached mode.
- [x]
- [o] Consider
worker.bootstrap.remote.at. - [x] Add parameter to
worker.executein order to build a project at the current folder in a standardized way.- [o] Consolidate
worker.repo.build.
- [o] Consolidate
- [o] Create flag for
worker.executecommand, in order to execute program based on files created viaworker.build. - [x] Create flag in order to execute program as a persistent service.
- [x] Merge
- [x] Delete obsolete
net.splitcells.network.workerrepo. - [x] Use this command for existing test deployment commands as well.
This tests whether this new command is portable or not.
- [o]
deploy.build.at-> This command is deleted. - [o]
deploy.test.extensively.at-> This command is deleted.
- [o]
- [x] Build everything via
mvn clean installatnet.splitcells.network.hub. - [x] Simplify
deploy.remote.
- [x] Deploy server software as systemd user service.
- [x] Correct download logs command.
- [x] Support flat folder on Java side.
- [x] Set program name to
net.splitcells.martins.avots.distro. - [x] Execute more test at once, in order to have a better load test on production.
- [x] Create UI tester for text editor as well, in order to test both.
- [o] Browser tests are not always working. Log message:
Target page, context or browser has been closed-> The warning logClosing HTML clients is implemented, but is not actually expected to be used in production.with its stack-trace was implemented and deployed in order to find the reason for this error. For now this task is closed, as this only appears sometimes. When the warning or log message is found again it will be attempted again. - [x] Fix JS errors in Gel's UI editor. -> JQuery was not unpacked by Maven, as there was a silent dependency resolution error.
- [x] Avoid deadlock in HTML client factory.
- [x] Playwright is not working anymore.
- [x] Install Playwright dependencies via Maven, so that the dependencies are more consistent. See
Playwright Notes. - [x] Try using only one browser playwright instance at a time.
- [x] Use public domain for Playwright based tests, so that the certificate can be accepted by the browser.
- [o] Try fixing Playwright's potential race condition, while still maintaining multiple Playwright instances. -> Using one Playwright instance, that is shared across multiple testers makes this work and is even more performant.
- [x] Use only one browser instance and one browser tab for each tester instead,
in order to avoid process leak in Playwright.
Playwright does not seem to close all processes/threads after the browser and Playwright is closed in Java,
as many
Socket ProcessandUtility Processprocesses with dedicated PIDs were found on live server. - [ ] There seems to be a race condition regarding the close method. ->
This may be caused by not closing the tabs of the HTML clients after the test.
Currently, it is not the close method, but the newPage method of Playwright instead.
Make recycle browser tabs like the browser itself.
- [x] Playwright is not thread safe.
Use one dedicated instance per thread.
- [x] Simplify Playwright factory implementation by using a thread safe queue instead.
- [x] Provide thread safe queue implementation.
- [x] A solution is implemented and deployed, that provides one browser per live tester thread. Check later, if this solution is working for more than 24 hours.
- [x] Playwright is not thread safe.
Use one dedicated instance per thread.
- [x] Install Playwright dependencies via Maven, so that the dependencies are more consistent. See
- [x] Avoiding sharing document files in
worker.executeby default. - [x] Pull source code from Codeberg instead of GitHub.
- [x] Avoid logging to stdout and stderr, in order to have a clean systemd log.
- [x] Correct deployment via worker on live server. Currently, it is completely broken as the file paths have changed.
- [x] Automatic upgrade does not always work. There is sometimes a difference between unattended-upgrades
(with apt-daily and apt-daily-upgrade) and
apt update && apt upgrade --yes.- [o] Create own automatic restart service, if this gets too complicated. It already cost too many hours. Also keep in mind that unattended-upgrades config is very complex and therefore already an argument in itself to replace it with simple custom command. Especially, when the debug log is so bad, because one does not see the concrete APT/dpkg actions in the log. If this is done, document this reasoning. -> The unattended-upgrades usage is fixed, so this is not needed for now.
- [x] Check if unattended-upgrades is working with some fixes.
- [ ] If this works, persist fixes in private git repo.
- [x] Try solving the problem via Origins-Pattern of
"origin=*";and"o=*";. - [x] Expand
/etc/apt/apt.conf.d/20auto-upgradeswithAPT::Periodic::Enable "1";. - [x] Expand
/etc/apt/apt.conf.d/50unattended-upgradeswithUnattended-Upgrade::SyslogEnable "true";Unattended-Upgrade::SyslogFacility "daemon";Unattended-Upgrade::Verbose "true";. - [x] Expand
/etc/apt/apt.conf.d/50unattended-upgradeswithUnattended-Upgrade::Debug "false";.
- [x] Make unattended-upgrades work.
- [x] Do not require
loginctl enable-lingerin order to run Podman container without ssh session, in order to ensure, that all programs of ssh sessions are closed. -> The container is now run via a systemd user service and thereforeloginctl enable-lingeris not needed anymore. - [x] Create double-checking for every config step. -> Check description is present in config script.