Service
- The server is publicly available at http://live.splitcells.net
- Update server.
- Restart server, in order to ensure, that every process uses the newest packages.
- Upgrade major version of OS when available.
- Update deployed software
- Improve deployment and its processes.
- Test security
- Test legalities and privacy policy.
Open Tasks
- [ ] Sometimes submitting an optimization does not work.
- [ ] Add a test for submitting optimization to the daily test.
- [ ] Create UI tester for text editor as well, in order to test both.
- [ ] Execute more test at once, in order to have a better load test on production.
- [ ] Synchronize Playwright in Container created by
network.execute
and in Network Bom,
in order to avoid some Playwright integration issues.
- [ ] Create test workers like htmlClient, but without a browser, because currently the browser tests seem to be kind of unreliable.
The reason for that, is that something goes wrong after a while in the Playwright integration.
There always new problems and tests
- [ ] Document the goal of non GUI test workers.
- [ ] Consider HTML/Javascript client written purely in Java as well.
- [ ] For this the ProjectsRenderer needs to be a Dem Option.
- [x] Create ProjectsRendererOption.
- [ ] Initialize at least, when the Live Distro or its Dev is run.
- [ ] Fix memory leak in main Java service, that get it killed by the OS in 2 days.
See
Main service killed by OOM killer after 2 days.
.
- [o] Restart the application every Sunday once at 1 hour after midnight. -> It worked for some days. It seems to be better to let the program run as long as possible, in order to find some issues.
- [ ] Every program exit should cause a heap dump, for better maintenance.
- [x] Core dumps are created by default on JVM crashes. These should be enough.
Set
-XX:ErrorFile=
for the JVM, so core dumps are persisted and can be analyzed.
- [ ] Delete all core dumps older than 7 days, as these could contain private information.
Do this in the
worker.execute
command via a generated execution script, when the flag --class-for-execution
flag is used,
because the worker.execute
determines, that the core dumps are created and determines its location.
- [ ] Document this in the arg doc of the network worker and note, that this is done in order to comply with EU's GDPR.
- [ ] Create script template.
- [ ] Generate script.
- [ ] Launch script in container instead of using Java entrypoint.
- [ ] Create error reporter page, that lists all errors without duplicates and not the complete log.
- [ ] Install a TUI for Docker in order to debug Forgejo runner, as it sometimes does seem to start a queued workflow a bit late.
- [ ] Create and user generic
worker.execute
command, in order to make things portable regarding the infrastructure.
- [ ] Deploy server software as systemd user service.
- [x] Create user service.
- [x] Make user service reachable via network.
- [x] Start user service on server start automatically.
- [ ] Build image during build command and execute image during execute command with
net.splitcells.network/bin/worker.execute
,
instead of net.splitcells.network.worker/bin/worker.execute
.
Currently, the build command builds the Java part and the execute command builds the container image.
- [x] Merge
worker.execute.*
commands into one worker.execute
command.
- [x]
worker.execute
is command with file storage.
- [x] Use more descriptive names for
$1
amd $2
.
- [x]
worker.program
is command without file storage.
- [x]
worker.service
is command to execute command in detached mode.
- [ ] Consider
worker.bootstrap.remote.at
.
- [ ] Add parameter to
worker.execute
in order to build a project at the current folder in a standardized way.
- [ ] Consolidate
worker.repo.build
.
- [ ] Create flag for
worker.execute
command, in order to execute program based on files created via worker.build
.
- [ ] Create flag in order to execute program as a persistent service.
- [ ] Delete obsolete
net.splitcells.network.worker
repo.
- [ ] Use this command for existing test deployment commands as well.
This tests whether this new command is portable or not.
- [ ]
deploy.build.at
- [ ]
deploy.test.extensively.at
- [ ] Build everything via
mvn clean install
at net.splitcells.network.hub
.
- [ ] Restart main service, when the UI testers are not working anymore.
Currently, an error for the UI testers not working could not be found.
- [ ] Execute runtime profiling for long-running instances and store these,
in order to improve day to day performance via Grafana and Pyroscope.
- [ ] Automatically and continuously check, if the SSL certificate for HTTPS is still valid
and replace it automatically.
- [ ] Execute the complete test suite on live server periodically and commit result data to network log.
- [ ] Automatically restart server after update installation.
- [ ]
apt upgrade
packages are seemingly not installed by unattended-upgrades.
This is required for Linux kernel updates.
- [ ] Make default file storage locations more sane regarding Linux home folder standard.
- [ ] Migrate files on live server accordingly.
- [ ] Manage upgrading major OS versions.
- [ ] The corresponding systemd service should only store logs up to 7 days.
- [ ] Make private setup script public, in order to have a basis for default setup script for a server.
- [ ] Run Forgejo Runner via Podman in order to not require root rights for Forgejo Runner: https://code.forgejo.org/forgejo/runner/src/branch/main/scripts/systemd.md
- [ ] Do not log already logged message, in order to simplify logs on live server.
- [ ] Do not output logs to standard output by default,
in order to have minimal OS logs.
- [ ] Consider creating double book-keeping for config files, in order to check ones, that are not used.
Abort the software, when such an unused file is found.
- [ ] Create backup of files.
- [ ] Do disaster recovery tests.
- [ ] Update certificates for ACME automatically without an explicit restart, in order to avoid these expiring during production.
- [ ] Create dedicated error log or error search query.
- [ ] Consider migrate from Podman to k8s.
- [ ] In order to be able to run many other things with unified infrastructure like the Codeberg runner,
that currently kind of needs docker (via https://code.forgejo.org/forgejo/runner/src/branch/main/examples/kubernetes).
- [ ] Consider using https://k9scli.io/ as cli TUI.
- [ ] Block outgoing connections.
- [ ] Make setup script for live server open source as well.
- [ ] Create 404 page for web server.
Done Tasks
- [ ] Fix JS errors in Gel's UI editor. -> JQuery was not unpacked by Maven, as there was a silent dependency resolution error.
- [x] Avoid deadlock in HTML client factory.
- [x] Playwright is not working anymore.
- [x] Install Playwright dependencies via Maven, so that the dependencies are more consistent. See
Playwright Notes
.
- [x] Try using only one browser playwright instance at a time.
- [x] Use public domain for Playwright based tests, so that the certificate can be accepted by the browser.
- [o] Try fixing Playwright's potential race condition, while still maintaining multiple Playwright instances.
-> Using one Playwright instance, that is shared across multiple testers makes this work and is even more performant.
- [x] Use only one browser instance and one browser tab for each tester instead,
in order to avoid process leak in Playwright.
Playwright does not seem to close all processes/threads after the browser and Playwright is closed in Java,
as many
Socket Process
and Utility Process
processes with dedicated PIDs were found on live server.
- [ ] There seems to be a race condition regarding the close method. ->
This may be caused by not closing the tabs of the HTML clients after the test.
Currently, it is not the close method, but the newPage method of Playwright instead.
Make recycle browser tabs like the browser itself.
- [x] Playwright is not thread safe.
Use one dedicated instance per thread.
- [x] Simplify Playwright factory implementation by using a thread safe queue instead.
- [x] Provide thread safe queue implementation.
- [x] A solution is implemented and deployed, that provides one browser per live tester thread.
Check later, if this solution is working for more than 24 hours.
- [x] Avoiding sharing document files in
worker.execute
by default.
- [x] Pull source code from Codeberg instead of GitHub.
- [x] Avoid logging to stdout and stderr, in order to have a clean systemd log.
- [x] Correct deployment via worker on live server. Currently, it is completely broken as the file paths have changed.
- [x] Automatic upgrade does not always work. There is sometimes a difference between unattended-upgrades
(with apt-daily and apt-daily-upgrade) and
apt update && apt upgrade --yes
.
- [o] Create own automatic restart service, if this gets too complicated. It already cost too many hours.
Also keep in mind that unattended-upgrades config is very complex and therefore already an argument in itself to replace it with simple custom command.
Especially, when the debug log is so bad, because one does not see the concrete APT/dpkg actions in the log.
If this is done, document this reasoning.
-> The unattended-upgrades usage is fixed, so this is not needed for now.
- [x] Check if unattended-upgrades is working with some fixes.
- [ ] If this works, persist fixes in private git repo.
- [x] Try solving the problem via Origins-Pattern of
"origin=*";
and "o=*";
.
- [x] Expand
/etc/apt/apt.conf.d/20auto-upgrades
with APT::Periodic::Enable "1";
.
- [x] Expand
/etc/apt/apt.conf.d/50unattended-upgrades
with Unattended-Upgrade::SyslogEnable "true";Unattended-Upgrade::SyslogFacility "daemon";Unattended-Upgrade::Verbose "true";
.
- [x] Expand
/etc/apt/apt.conf.d/50unattended-upgrades
with Unattended-Upgrade::Debug "false";
.
- [x] Make unattended-upgrades work.
- [x] Do not require
loginctl enable-linger
in order to run Podman container without ssh session,
in order to ensure, that all programs of ssh sessions are closed.
-> The container is now run via a systemd user service and therefore loginctl enable-linger
is not needed anymore.
- [x] Create double-checking for every config step. -> Check description is present in config script.
Playwright Notes
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>net.splitcells</groupId>
<artifactId>worker.pom.empty</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>com.microsoft.playwright</groupId>
<artifactId>playwright</artifactId>
<version>1.45.0</version>
</dependency>
</dependencies>
</project>
mvn exec:java -e -D exec.mainClass=com.microsoft.playwright.CLI -D exec.args="install-deps"
Main service killed by OOM killer after 2 days.
Jan 01 15:40:56 net-splitcells-live systemd[821]: user.slice: A process of this unit has been killed by the OOM killer.
Jan 01 15:40:56 net-splitcells-live systemd[821]: libpod-480013e988f74f4b0a05947771663c4dc6b1e904f6984183568e8ceba925af67.scope: Consumed 1d 20h 33min 42.329s CPU time.
Jan 01 15:40:55 net-splitcells-live systemd[821]: libpod-480013e988f74f4b0a05947771663c4dc6b1e904f6984183568e8ceba925af67.scope: A process of this unit has been killed by the OOM killer.