Distracted Engineer


A view on stuff from an easily distracted engineer

Recovering from the Void: A Forensic Journey into n8n and SQLite Corruption

Recovering from the Void: A Forensic Journey into n8n and SQLite Corruption

The Crisis: The “Upgrade” that Vanished

It started with a routine minor upgrade of my n8n instance - this was triggered by seeing some articles mentioning some severe security issues. It was late, and since I’d upgraded n8n heaps of times before and this was a minor upgrade, I didn’t even check if it was working as expected beyond the container starting.

Everything seemed fine until I logged in the next morning to investigate why some expected triggers hadn’t gone off. To my horror, my workflows had reverted to a state from about 5 days ago. I had lost almost a week of work tuning some automations I was using for sending messages from my Frigate NVR, which I am using for my home security system.

In this post, I’ll walk you through the technical details of recovering data from a corrupted, 14GB SQLite database, the forensic tools we used, the mistakes made, and how I eventually “surgically repaired” my way back to a working instance. I learned a lot, and I figure if that helps someone else one day, it’s worth writing up.

Note: I used ChatGPT and Gemini (once I ran out of messages on ChatGPT) to help me work through all this. While I’m confident I could have got there eventually the old-fashioned way, it sure felt easier using these tools.


1. The Mystery of the Missing Days

I initially attempted a restore from a Restic snapshot taken right before the upgrade. However, the Docker container refused to stay up, spitting out this error:

n8n-1  | There was an error running database migrations
n8n-1  | SQLITE_IOERR: disk I/O error
n8n-1 exited with code 1

What likely happened during the upgrade:

In a 14GB SQLite database on a busy n8n instance, the database file and its rollback journal are in a constant state of flux. When I performed the upgrade, a few things probably converged to create the “5-day gap”:

  1. Delayed Checkpointing/Merging: With a database that size, SQLite can struggle to merge journal data back into the main .sqlite file. It’s possible that days’ worth of changes were durable in the journal’s eyes but hadn’t been fully baked into the main file.
  2. The “Migration Trap”: When n8n upgrades, it runs database migrations. If a migration is interrupted or crashes due to the sheer size of the tables or memory exhaustion, it can roll back the database to the last consistent state. In my case, that “last consistent state” was unexpectedly old.
  3. Torn Writes and Volume Locking: Docker volume mounts can sometimes mismanage how SQLite expects locks to work. The upgrade process likely triggered a lock contention that prevented the latest transactions from being written to the physical file before the container restarted.

By the time I checked the next morning, the system was essentially looking at a “snapshot” of the database from several days prior, with the most recent changes trapped in a journal file that the new version of n8n couldn’t reconcile.

Technical Root Cause:

My database was massive (14.6 GB). SQLite default behavior uses Rollback Journals (the database.sqlite-journal file). Because the backup was taken while n8n was running (a “hot backup”), the snapshot captured the database in the middle of a transaction.

When I restored it, SQLite saw the journal and tried to “undo” or “redo” that transaction. Because of the sheer size of the DB and likely filesystem locking issues within Docker, it failed, leading to the SQLITE_IOERR.

Why was my database so massive?

Well, I had been working on automations that send notifications via Telegram when my Frigate instance detected people at the front door, etc. What I didn’t consider is that 1) these events fire very often and 2) n8n was retaining the binary image/video data of these executions. While I got these cleared up eventually, I didn’t notice that in my testing it had caused the n8n database to balloon.


2. Mistake #1: Removing (and then Replaying) the Journals

In a naive attempt to get the DB to boot, I first moved the journal files (database.sqlite-journal and crash.journal) out of the way.

The Result: n8n started! The Problem: The workflows were still 5 days old.

The Theory of the “Ghost Journal”

I realized that by deleting the journals, I had essentially told SQLite to ignore all the changes waiting “on the doorstep” and to only look at the building as it stood 5 days ago. I thought I could outsmart SQLite by restoring from backup again and forcing it to “replay” those journals manually.

I tried to open the database using the SQLite CLI, hoping it would see the journal and apply the changes.

What actually happened: The replay “succeeded” in a technical sense—the journal file disappeared—but when I checked the timestamps on the workflows, they were still 5 days old.

The Insight:

When a journal replay fails to show “new” data, it’s usually because of one of two brutal realities:

  1. The Journal was a Rollback, not a Roll-forward: In “Rollback Journal” mode (the default), the journal contains the old data to be restored if a transaction fails. If the database was captured mid-migration, replaying the journal actually ensures the database stays in its older, pre-migration state.
  2. Inconsistency: Because the “hot backup” was taken while the file was being written, the journal and the main DB file were subtly out of sync. SQLite looked at them, decided they didn’t match perfectly, and chose to preserve the database’s “safe” state—the 5-day-old version—rather than risking corruption by replaying a mismatched journal.

At this point, I realized that replaying the journal was effectively a dead end. To get the data back, I had to stop acting like a database administrator and start acting like a digital archaeologist.


3. The Forensic Breakthrough: strings and .recover

I knew the data was there. I had a feeling that if I could just see if the “missing” text existed in the file, I’d know if I was chasing a ghost.

I was using Gemini to help brainstorm, and it gave me a simple but brilliant idea: use the Linux strings utility to search the binary file for the names of the workflows I knew I had edited in the last 48 hours.

I ran a quick check on the raw binary of the 14GB file:

strings database.sqlite | grep "Frigate - Reviews (Stitcher)"

It returned results!

Seeing that specific string—a workflow name I had only created two days prior—was the “smoking gun.” It meant the JSON text was physically on the disk in the “unallocated space” or trapped in pages the index had ignored. It wasn’t “gone”; it was just “unindexed.”

The Tool: sqlite3 .recover

Standard SQL queries follow the index. If the index is corrupt or rolled back, the data is “invisible.” The .recover command (added in SQLite 3.29.0) ignores the index and scans every byte of the file, “carving” out valid-looking rows.

The Command:

# Using Alpine because it ships with a modern, feature-complete SQLite
docker run --rm -v "$(pwd):/workspace" -w /workspace alpine sh -c "
  apk add --no-cache sqlite && \
  sqlite3 database.sqlite '.recover' | sqlite3 recovered_forensic.sqlite"

Crucial Insight: I had to restore the backup again and run this without letting the journal replay. Replaying the journal would have “officially” deleted the forensic data I was trying to find by overwriting those “free” pages with the old 5-day-old state. By deleting the journal and running .recover on the “dirty” database file, I was forcing the tool to scrape the “quiet” parts of the file where the newer data was still sitting, without letting the journal “clean it up.”


4. The 14GB Elephant: Cleaning the Bloat

The recovery worked, but I now had a 14.9 GB recovered file. I knew that my workflows certainly didn’t add up to 14 GB, which meant the file was mostly “bloat”—likely old execution history and logs.

Investigating the Bloat

To confirm exactly where the space was going, I used a SQLite query to check the size of each table. Since the file was too large to open easily in a GUI, I ran this via Docker:

docker run --rm -v "$(pwd):/workspace" -w /workspace alpine sh -c "
  apk add --no-cache sqlite && \
  sqlite3 recovered_forensic.sqlite '
    SELECT 
      name, 
      SUM(pgsize)/1024/1024 AS size_mb 
    FROM dbstat 
    GROUP BY name 
    ORDER BY size_mb DESC 
    LIMIT 10;'
"

The results were eye-opening:

  • execution_data: 14,212 MB
  • workflow_entity: 0 MB (Relative to the GBs of logs, my actual work was a rounding error)

It turned out that execution_data—the table that stores the massive JSON blobs and binary metadata from every node in every past run—was responsible for 99.9% of the database size.

The Surgical Clean

To make the database safe for n8n to use again, I had to perform a “surgical strike” to remove this bloat. If I hadn’t, the first thing n8n would do on startup is try to run migrations on these 14GB tables, which would almost certainly crash the process again.

DELETE FROM execution_data;
DELETE FROM execution_entity;
VACUUM;

The VACUUM command is the most important part here—it tells SQLite to actually rebuild the file and release the empty space back to the OS. After this, the DB shrank from 14.9GB to roughly 20MB.


5. Database Surgery: Solving the “Broken JSON” Crash

Even with the data recovered and the “bloat” removed, n8n still refused to boot. It crashed with a familiar but frustrating error: SyntaxError: Unexpected end of JSON input

Forensic recovery isn’t perfect. It scans for valid-looking rows, but sometimes it recovers “torn writes”—scraps of data where the beginning of a JSON string was found, but the end had been overwritten.

Identifying the Radioactive Rows

I used SQL to find exactly which records were invalid JSON. I looked at both the active workflows and the history table:

SELECT id, name, createdAt, updatedAt, active, versionId, json_valid(nodes), json_valid(connections) FROM workflow_entity ORDER BY updatedAt DESC;
SELECT versionId, workflowId, createdAt, updatedAt, json_valid(nodes), json_valid(connections) FROM workflow_history;

This highlighted two major issues:

  1. The Broken Active Workflow: Agent Tools Backend (ID: pcgNxetwshnFEyB5) was corrupted. Because this workflow was marked as “Active,” n8n tried to parse and instantiate it on boot, causing the global crash.
  2. The Corrupted History: A specific record for Frigate - Reviews (Stitcher) in the workflow_history table was also mangled, which would have crashed the UI later if n8n had managed to start.

The “Fixes.sql” Strategy

I didn’t want to play whack-a-mole with a live database. I adopted an iterative repair strategy:

  1. Restore the “Recovered & Vacuumed” DB from a safety copy.
  2. Run a fixes.sql script.
  3. Try to boot n8n.
  4. If it crashed, analyze the error, add a new fix to fixes.sql, and go back to step 1.

This meant I ended up with a single, reproducible script that fixed everything at once.

Repairing the “Agent Tools Backend”

Since the main record was toast, I used the workflow_history table as an “organ donor.” I pulled the last known-good nodes and connections into the main entity table:

-- Fix Agent Tools Backend (pcgNxetwshnFEyB5) from its own history
UPDATE workflow_entity 
SET nodes = (SELECT nodes FROM workflow_history WHERE workflowId = 'pcgNxetwshnFEyB5' ORDER BY createdAt DESC LIMIT 1),
    connections = (SELECT connections FROM workflow_history WHERE workflowId = 'pcgNxetwshnFEyB5' ORDER BY createdAt DESC LIMIT 1),
    staticData = NULL, -- Clear potentially corrupted state
    pinData = '{}',
    versionId = '6b6369ae-e7c4-46b4-b0af-e07c3b08f612', 
    activeVersionId = '6b6369ae-e7c4-46b4-b0af-e07c3b08f612' 
WHERE id = 'pcgNxetwshnFEyB5';

I also had to “borrow” some metadata (like meta and settings) from another healthy workflow to ensure the row was structurally complete.

Cleaning up the History

For the Frigate workflow, the history record itself was a problem. I repaired it by reversing the process—copying the valid nodes from the main entity back into the history to stop the background workers from choking:

UPDATE workflow_history 
SET nodes = (SELECT nodes FROM workflow_entity WHERE id = 'QUx4a1z0HKn0bHPN') 
WHERE workflowId = 'QUx4a1z0HKn0bHPN';

After applies these repairs, n8n finally booted to the login screen.


6. Lessons Learned & Making it Robust

I survived this, but I never want to do it again. Here is the new architecture I’ve implemented:

A. Aggressive Pruning

I updated n8n environment variables to prevent the DB from ever hitting 14GB again. If you’re doing anything with images or high-frequency triggers, this is mandatory:

  • EXECUTIONS_DATA_MAX_AGE=48 (2 days)
  • EXECUTIONS_DATA_PRUNE=true

B. Decoupled Backups (The “Gold Standard”)

I no longer rely solely on restic backups of the host (i.e. of the SQLite file). I now run a hourly cron job that exports workflows and credentials to a flat-file directory using the n8n CLI:

n8n export:workflow --backup --output=/backups/workflows
n8n export:credentials --backup --output=/backups/credentials

C. Git-Based Versioning

I turned the /backups folder into a Git repository. Now, every hour, the server automatically commits the JSON files. I have a full git diff history of every change I make to every workflow. If the DB dies, I just n8n import the JSON. This is essentially a “Time Machine” for my automations.


Conclusion

If you are running n8n on SQLite:

  1. Monitor your DB size. Over 1GB is a warning; over 10GB is a disaster waiting to happen.
  2. Don’t trust “hot” backups of SQLite files—always quiesce or stop the container if you can.
  3. Use the CLI export tool. It is your only true safety net.
  4. Use AI Tools Used right, they make working through these complex issues alot easier.

Forensic recovery is possible, but it’s much easier to just git checkout your way out of a problem.


A Simple SSH Proxy for Life Between Home and Work: go-home-or-away

Moving a laptop between home and work networks can be a pain, especially when it comes to keeping SSH sessions alive and seamless. I recently ran into this problem with VSCode’s remote SSH feature: at home, I could connect directly to my servers, but at work, I needed to go through a proxy. My old solution was to maintain separate .ssh/config entries for each location, using ProxyJump for the remote connection. This meant restarting VSCode sessions every time I switched networks—a hassle I wanted to avoid.

The Solution: go-home-or-away

To make my life easier, I built a small tool called go-home-or-away. It’s a conditional proxy for SSH, written in Go, designed to be used as a ProxyCommand in your SSH config. The idea is simple: if you can reach your target host directly, it connects straight through. If not, it automatically proxies your connection via a specified SSH server.

This is perfect for anyone who moves between networks with different access rules—like working from home, the office, or a coffee shop—without wanting to constantly tweak SSH configs or restart sessions.

How It Works

The logic is straightforward:

  • When you’re “home” (i.e., direct TCP connection to the target host/port works), it connects directly.
  • When you’re “away” (direct connection fails), it falls back to proxying through your chosen SSH server using ssh -W.

Setup is simple and works on both Windows and Linux. Just drop the binary somewhere in your path and update your SSH config to use it as a ProxyCommand. The README in the repo has copy-pasteable examples for both platforms.

Why Go?

I chose Go for this project because it lets you build a single, static binary—no need for messy scripts or extra dependencies. The implementation was pretty simple, and as a bonus, it gave me a chance to use Cline with Gemini 2.5 Pro to generate most of the code. (The code isn’t complex, but it was nice to automate the boring parts.)

Setup & Usage

The setup I use is almost exactly what’s in the README. On Linux, I put the binary in /usr/local/bin and add a ProxyCommand line to my ~/.ssh/config. On Windows, it’s the same idea—just point to the .exe in your config. I haven’t run into any issues so far; it “just worked” out of the box.

Who’s This For?

If you, like me, move between networks and want your SSH sessions (and VSCode remote connections) to “just work” no matter where you are, this tool is for you. No more juggling multiple config entries or restarting sessions.

Feedback & Future

I don’t have any big plans for new features, and I’m not looking for contributors right now, but feedback is always welcome.

If you’re interested in trying it out, the code and binaries are available here: github.com/cfullelove/go-home-or-away

Conclusion

Sometimes the best tools are the ones that quietly solve a real problem without much fuss. go-home-or-away has made my daily workflow a little smoother, and maybe it’ll help someone else too.


Troubleshooting Proxmox LXC Container Backup Errors

Today, I encountered a perplexing issue while attempting to back up two LXC containers in Proxmox. These containers threw errors during the backup process, and after a bit of trial and error, I managed to resolve the issue. I hope that by sharing my experience, others may avoid the same pitfalls.

The Problem: Backup Errors and Permission Denied Messages

My journey began when I noticed repeated failures in my Proxmox backup logs for container 206. The error message was as follows:

INFO: Starting Backup of VM 206 (lxc)
INFO: Backup started at 2024-12-22 18:00:20
INFO: tar: ./home/ansible/.ansible: Cannot open: Permission denied
INFO: tar: ./home/ansible/.ssh: Cannot open: Permission denied
ERROR: Backup of VM 206 failed - command [...] failed: exit code 2

The logs indicated that files within the container’s home directory could not be opened due to permission issues. Despite my best efforts searching through articles, none provided the clarity I needed. Fortunately, a Proxmox forum post here pointed me in the right direction.

The Root Cause: Altered User ID Mapping

Upon further investigation, I discovered that the issue stemmed from altered user ID mappings after the containers’ creation. Essentially, I adjusted the user mappings in my LXC containers such that user 1000 inside the container was mapped to a different user on the Proxmox host. This change wasn’t initially problematic, but it inadvertently led to ownership conflicts because the files created before modifying the user mappings retained the original user ID ownership.

For container 206, the user mapping configuration in /etc/pve/lxc/206.conf looked like this:

lxc.idmap: u 0 100000 1000
lxc.idmap: g 0 100000 1000
lxc.idmap: u 1000 1000 1
lxc.idmap: g 1000 1000 1
lxc.idmap: u 1001 101001 64535
lxc.idmap: g 1001 101001 64535

The Initial State of Directory Permissions

The misalignment between the expected and actual ownership meant that the files had the wrong UID, preventing the backup tool from accessing them. Before fixing, the files in the container’s home directory had the following permissions:

root@real4:\~# ls -la /tank/subvol-206-disk-0/home/ansible
total 31
drwxr-xr-x 4 101000 101000    7 May  5  2024 .
drwxr-xr-x 3 100000 100000    3 May  5  2024 ..
drwx------ 3 101000 101000    3 May  5  2024 .ansible
-rw-r--r-- 1 101000 101000  220 Mar 28  2022 .bash_logout
-rw-r--r-- 1 101000 101000 3526 Mar 28  2022 .bashrc
-rw-r--r-- 1 101000 101000  807 Mar 28  2022 .profile
drwx------ 2 101000 101000    3 May  5  2024 .ssh

The UIDs were 101000 instead of 1000, which caused the permission issues during the backup process.

The Solution: Correcting File Ownership

To resolve the permission issues, I had to correct the ownership of the files in question. Here’s what I did:

# chown -R 1000:1000 /tank/subvol-206-disk-0/home/ansible

This command recursively changed the ownership of the files in the directory to the correct UID and GID (1000:1000). Post-adjustment, my containers backed up successfully, and the permissions were as they should be:

root@real4:\~# ls -lan /tank/subvol-206-disk-0/home/ansible
total 31
drwxr-xr-x 4   1000   1000    7 May  5  2024 .
drwxr-xr-x 3 100000 100000    3 May  5  2024 ..
drwx------ 3   1000   1000    3 May  5  2024 .ansible
-rw-r--r-- 1   1000   1000  220 Mar 28  2022 .bash_logout
-rw-r--r-- 1   1000   1000 3526 Mar 28  2022 .bashrc
-rw-r--r-- 1   1000   1000  807 Mar 28  2022 .profile
drwx------ 2   1000   1000    3 May  5  2024 .ssh

Conclusion

If you encounter backup errors in Proxmox with similar permission denied messages, consider checking your user ID mappings, particularly if these have been altered post-container creation. Correcting file ownership based on the adjusted UID mappings can potentially save your day. I hope this writeup helps someone else dealing with a similar conundrum. Have questions or tips to share? Feel free to leave a comment below!


Creating calendar events on the nth business day each month

Problem

We have monthly reports that are due each month and I wanted to create an automated reminder that they were due using Slack and Google Calendar using Zapier.

Unfortunately, the reports are due in a pattern that Google Calendar repeating events doesn’t support:

  • The 10th business day of the month. E.g. in February 2022 the 10th business day is Monday 14th
  • The first business day that is on or after the 10th day of the month. E.g. 10th October 2021 is a Sunday, so the report would be due the next day on Monday 11th October

Solution

Luckily, while Google Calendar doesn’t let you create these types of events in web app, it does let you add a custom event that has been crafted somewhere else.

I found a great article that really explained the process well and how these custom events needed to work. Using this I was able to create the calendar events for the two scenarios I was after

10th Business Day of the month

BEGIN:VCALENDAR
VERSION:2.0
BEGIN:VEVENT
RRULE:FREQ=MONTHLY;INTERVAL=1;BYDAY=MO,TU,WE,TH,FR;BYSETPOS=10
SUMMARY:Brisbane Report Due
LOCATION:msa-reporting
DTSTART;VALUE=DATE:20210701T090000
SEQUENCE:0
DESCRIPTION:@here The Brisbane report is due today! 
END:VEVENT
END:VCALENDAR

The RRULE is what defines the recurrence behaviour. Breaking that line down to see how it works:

  • FREQ=MONTHLY - repeat monthly
  • INTERVAL=1 - repeat each month
  • BYDAY=MO,TU,WE,TH,FR - only on weekdays
  • BYSETPOS=10 - the 10th day that matches the above rule

First Business Day on or after the 10th day of the month

BEGIN:VCALENDAR
VERSION:2.0
BEGIN:VEVENT
RRULE:FREQ=MONTHLY;INTERVAL=1;BYDAY=MO,TU,WE,TH,FR;BYMONTHDAY=10,11,12,13;BYSETPOS=1
SUMMARY:Melbource Report Due
LOCATION:msa-reporting
DTSTART;VALUE=DATE:20210701T090000
SEQUENCE:0
DESCRIPTION:@here The Melbourne report is due today! 
END:VEVENT
END:VCALENDAR

This one is a bit more complicated. Breaking that line down to see how it works:

  • FREQ=MONTHLY - repeat monthly
  • INTERVAL=1 - repeat each month
  • BYDAY=MO,TU,WE,TH,FR - only on weekdays
  • BYMONTHDAY=10,11,12,13 - AND only on the 10th - 13th day of the month (handling when the 10th lands on a weekend)
  • BYSETPOS=1 - the first day the matches the previous two constrains (weekday && 10-13 day of the month)

Constraints

Unfortunately, neither of these rules don’t account for weekdays that aren’t business days such as public holidays. For example, in January 2022 the custom event for the 10th business day will show the Friday 14th, however, Monday the 3rd is a public holiday so, in fact, the 10th business day is Monday 17th

References

I found the following links helpful when working on this problem:


Handling syslog clients sending blank fields

Problem

I recently starting centrally collecting logs on my home network using Promtail, Loki and Grafana This lets you easily search and analyse logs for all sorts of things.

One of the services I wanted to collect logs for was my TP-Link Wireless Access Points managed by TP-Link Omada. However, it turns out that the APs weren’t setting the app-name field correctly when sending logs to the remote server.

This resulted in errors in promtail that frequently reset the connection to rsyslogd that was acting as a syslog relay.

This was the error I could see in the logs:

promtail:

level=warn ts=2022-01-04T04:08:21.013648461Z caller=syslogtarget.go:216 msg="error parsing syslog stream" err="expecting an app-name (from 1 to max 48 US-ASCII characters) or a nil value [col 50]

rsyslog:

rsyslogd: omfwd: TCPSendBuf error -2027, destruct TCP Connection to promtail:514 [v8.36.0 try http://www.rsyslog.com/e/2027 ]
rsyslogd-2027: omfwd: TCPSendBuf error -2027, destruct TCP Connection to promtail:514 [v8.36.0 try http://www.rsyslog.com/e/2027 ]
rsyslogd-2007: action 'action 2' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.36.0 try http://www.rsyslog.com/e/2007 ]
rsyslogd-2359: action 'action 2' resumed (module 'builtin:omfwd') [v8.36.0 try http://www.rsyslog.com/e/2359 ]

Environment

Syslog Clients:

  • Omada Controller 4.4.8
  • TP-Link EAP 620 HD (x2)

rsyslogd configured as a forwarder to promtail. docker-compose extract below:

version: '3'

services:
#    -- snip --
  loki:
    image: grafana/loki:2.4.1
    restart: unless-stopped
    volumes:
      - ./loki-config.yaml:/mnt/config/loki-config.yaml
    command:
      - --config.file=/mnt/config/loki-config.yaml
  promtail:
    image: grafana/promtail:2.4.1
    restart: unless-stopped
    volumes:
      - /var/log:/var/log
      - ./promtail-config.yaml:/mnt/config/promtail-config.yaml
    command:
      - --config.file=/mnt/config/promtail-config.yaml
  rsyslog:
    image: rsyslog/syslog_appliance_alpine
    restart: unless-stopped
    ports:
      - 514:514/udp
      - 514:514/tcp
    environment:
      - RSYSLOG_CONF=/config/rsyslog.conf
    volumes:
      - ./rsyslog.conf:/config/rsyslog.conf

promtail configuration:

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      __path__: /var/log/*log
  pipeline_stages:
    - regex:
        expression: "^[\\w]+\\s+[\\d]+\\s+[\\d|:]+ (?P<host>[^\\s]+) (?P<tag>[^:\\[]+)"
    - labels:
        host:
        tag:
- job_name: syslog
  syslog:
    listen_address: "0.0.0.0:514"
    idle_timeout: 10m
    label_structured_data: yes
    labels:
      job: "syslog"
  relabel_configs:
    - source_labels: ["__syslog_connection_ip_address"]
      target_label: "ip_address"
    - source_labels: ["__syslog_message_severity"]
      target_label: "severity"
    - source_labels: ["__syslog_message_facility"]
      target_label: "facility"
    - source_labels: ["__syslog_message_app_name"]
      target_label: "appname"
    - source_labels: ["__syslog_message_hostname"]
      target_label: "host"

rsyslog.conf configured according to: https://grafana.com/docs/loki/latest/clients/promtail/scraping/#rsyslog-output-configuration

Solution

The way I solved this problem was by configuring rsyslog to use a modified template that sets the app-name to - (nil) when the app-name field is blank. The resulting configuration is shown below. Note that this solution should work for any field (e.g. hostname) send from mis-behaving syslog clients.

...

:app-name, !isequal, "" {
    action(type="omfwd" protocol="tcp" target= "promtail" port="514" Template="RSYSLOG_SyslogProtocol23Format" TCP_Framing="octet-counted" KeepAlive="on")
}

# RSYSLOG_SyslogProtocol23Format but with app-name hard-coded to '-'
template(name="missingAppName" type="string" string="<%PRI%>1 %TIMESTAMP:::date-rfc3339% %HOSTNAME% - %PROCID% %MSGID% %STRUCTURED-DATA% %msg%\n")

:app-name, isequal, "" {
    action(type="omfwd" protocol="tcp" target= "promtail" port="514" Template="missingAppName" TCP_Framing="octet-counted" KeepAlive="on")
}

...

References

I found the following links useful while working through this problem: