qortal/.claude/JFR_changes_3.md

# JFR / GC Analysis (Run 2) — Changes Applied

Reference analysis: `/home/nicola/Downloads/_QORTA/analysis.md` (see also `.claude/JFR_analysis_3.md`)

This is the third round of fixes. The reference analysis confirmed the OOM
trajectory from Run 1 is fixed (heap stable 13+ hours) but surfaced new
contention, CPU, exception-rate and thread-churn findings. Priorities below use
the analysis.md numbering (P0–P8).

---

## P0 — G1HeapOccupancyPercent=35 (NOT changed here)

Already present in the node startup script — intentionally skipped in code.

---

## P1 — ArbitraryDataBuilderThread.java: DB status check held a monitor lock

**File:** `src/main/java/org/qortal/controller/arbitrary/ArbitraryDataBuilderThread.java`
**Commit:** `dfb7f551`

**Problem:** All 5 builder threads contend on
`synchronized (buildManager.arbitraryDataBuildQueue)`. Inside that block the old
code called `ArbitraryTransactionUtils.getStatus(...)`, which opens a DB
connection, runs a query, writes a status row and commits — potentially 10+
seconds under load. While one thread held the monitor doing that DB work, the
other 4 builder threads were all blocked. This is what the JFR caught as
`ConcurrentHashMap` / `JavaMonitorEnter` contention with waits up to ~11.6s
(the analysis attributed it to "computeIfAbsent slow compute", but the real
mechanism was the DB call under the queue monitor).

**Fix:** `prepareForBuild()` (which sets `buildStartTimestamp`, marking the item
in-progress so other threads' `isQueued()` filter skips it) now runs *inside*
the lock to reserve the item. The DB `getStatus()` call was moved *outside* the
synchronized block. The 5 builder threads can now overlap their DB calls instead
of serializing through the queue monitor.

---

## P4 — pom.xml: Jackson library stack split across incompatible versions

**File:** `pom.xml`
**Commit:** `e36bdd3f`

**Problem:** The Jackson ecosystem was split across two versions on the runtime
classpath:

| Version | Artifacts | Source |
|---|---|---|
| **2.18.2** | jackson-jaxrs-json-provider, jackson-jaxrs-base, jackson-module-jaxb-annotations, jackson-dataformat-yaml, jackson-datatype-jsr310 | transitive via `swagger-core:2.2.30` |
| **2.15.2** | jackson-databind, jackson-core, jackson-annotations | explicitly pinned in pom.xml |

The `jackson-jaxrs-json-provider:2.18.2` is compiled against
`jackson-databind:2.18.2` internals but ran against `2.15.2`, throwing
`NoSuchMethodError` / `IncompatibleClassChangeError` on the `DirectMethodHandle$Holder`
path for every JAX-RS request that serialized a JSON response — the ~124
exceptions/second (≈5.8M total/run) seen on the Jetty worker threads.

**Fix:** Imported the official Jackson BOM in a new `<dependencyManagement>`
section:

```xml
<dependency>
  <groupId>com.fasterxml.jackson</groupId>
  <artifactId>jackson-bom</artifactId>
  <version>${jackson.version}</version>   <!-- 2.18.2 -->
  <type>pom</type>
  <scope>import</scope>
</dependency>
```

Added a `jackson.version=2.18.2` property and removed the now-redundant
hardcoded `2.15.2` from the direct `jackson-databind` dependency. `mvn
dependency:tree` confirms all 8 Jackson artifacts now resolve to **2.18.2**; the
project compiles cleanly against the bumped databind. The 2.15.2 pin came from
the "JSON import/export support" commit and was an arbitrary version choice (the
"latest stable" comment was wrong — swagger already required 2.18.2), so
aligning upward carries no known regression and picks up later security fixes.

---

## P5 — HSQLDBDatabaseUpdates.java: cache evicting on row-count, not bytes

**File:** `src/main/java/org/qortal/repository/hsqldb/HSQLDBDatabaseUpdates.java`
**Commit:** `7170f539`

**Problem:** `BaseHashMap.clearToHalf()` was the #1 CPU hotspot at 10.6%, driven
by HSQLDB `Cache.cleanUp()`. Verified against the HSQLDB 2.7.4 source: the row
cache has **two independent limits** and evicts when *either* is hit
(`Cache.preparePut()` → `exceedsCount || exceedsSize`):

- `hsqldb.cache_size` (`SET FILES CACHE SIZE`) — default **10,000 KB**
- `hsqldb.cache_rows` (`SET FILES CACHE ROWS`) — default **50,000 rows**

A prior commit (`7a9a80a7`) ran `SET FILES CACHE SIZE 200000`, raising only the
**byte** budget to ~200 MB — `CACHE SIZE` is in kilobytes, not rows (the commit
comment conflated the two). The **row-count** limit stayed at its default 50,000.
Because the JFR was captured on this branch *after* that commit and still showed
`clearToHalf` thrashing, the working set is row-count-bound: it hits 50,000 rows
long before 200 MB, so the byte bump never took effect.

**Fix:** Added migration `case 53:` running `SET FILES CACHE ROWS 200000`, so the
cache can grow into the 200 MB byte budget already configured in case 52.
Verified in the `Cache` constructor that the backing arrays are pre-allocated to
the CACHE ROWS value, so 200,000 (matching the prior commit's stated intent and
the analysis's 100k–500k recommendation) was chosen rather than an arbitrarily
huge value — ~5–8 MB of index arrays, with memory still hard-bounded by the
unchanged 200 MB byte cap (heap-safe). The misleading comment on case 52 was also
corrected. A new `case 53:` (rather than editing case 52) ensures already-upgraded
nodes pick up the change on the next version bump.

---

## P3 — Network.java: Network-Worker thread churn (~53 created/min)

**File:** `src/main/java/org/qortal/network/Network.java`
**Status:** applied, **not yet committed**

**Problem:** `Network-Worker` threads were created and destroyed at ~53/min,
contributing to ~4% CPU in `ThreadLocalMap.set/remove` (stale-entry cleanup
runs on every set/remove when threads churn) plus safepoint overhead. The churn
is intrinsic to the `ExecuteProduceConsume` (EPC) algorithm: it spawns a logical
worker whenever every active thread becomes a consumer, and exits workers when
there's an idle-producer excess. The backing `ThreadPoolExecutor` was
`core=2, max=512, keepalive=5s, SynchronousQueue`, so almost every EPC spawn
created a brand-new physical `Thread` that then died within 5s — and physical
thread create/destroy is what churns `ThreadLocalMap`.

**Fix:** Keep physical threads warm for reuse:
- `core` **2 → 10** (`NETWORK_EPC_CORE_THREADS`) — a pinned warm baseline (core
  threads never time out) absorbs steady-state load with zero churn.
- `keepalive` **5s → 60s** (`NETWORK_EPC_KEEPALIVE`) — burst-overflow threads
  (above core) linger long enough to be reused instead of respawned.

**Deliberately NOT changed:** `max=512` and the `SynchronousQueue`. The analysis
suggested capping max at ~8 with a queue, but that is dangerous for this EPC: it
breaks the required immediate producer→consumer handoff and would cause
`RejectedExecutionException` storms / network stalls under burst. The sibling
`NetworkData-Worker` pool was verified to already be churn-free (`core=10` with an
unbounded `LinkedBlockingQueue` never grows past 10 threads) and was left alone.

---

## P2 — TradeOffersWebSocket.java: DB queries under a java.util.HashMap monitor

**File:** `src/main/java/org/qortal/api/websocket/TradeOffersWebSocket.java`
**Status:** applied, **not yet committed**

**Problem:** The analysis flagged `JavaMonitorEnter` contention on a raw
`java.util.HashMap` (20 events, 5,570ms, ~278ms average wait). The `JavaMonitorEnter`
class is the key clue: a data race *without* synchronization produces no monitor
events, so this had to be an explicit `synchronized (someHashMap)`. After ruling out
the wrapper-typed (`Collections$SynchronizedMap`) and non-map lock objects, and the
plain-HashMap monitors that are only ever touched single-threaded (`AccountRefCache`,
used solely from single-threaded `Block.process()`/`orphan()`) or hold no I/O
(`ChatNotifier.listenersBySession`, `TradeBotWebSocket.PREVIOUS_STATES`), the culprit
was `TradeOffersWebSocket.cachedInfoByBlockchain` (`new HashMap<>()`).

Its `listen()` method (a `NewChainTipEvent` handler, fired every block) held
`synchronized (cachedInfoByBlockchain)` across a loop of
`repository.getATRepository().getMatchingFinalATStates(...)` + `produceSummaries(...)`
**DB queries** — once per supported blockchain, per ACCT. Meanwhile
`onWebSocketConnect()` also locks `cachedInfoByBlockchain` to read initial data, so
every client connecting during a block's AT-state scan blocked for the full scan
duration. (Not a corruption risk — it *is* synchronized — but the monitor was held
far too long.)

**Fix:** Moved all repository/DB work in `listen()` *outside* the
`synchronized (cachedInfoByBlockchain)` block. The DB queries build the
`crossChainOfferSummaries` list with no lock held; the monitor is then taken only for
the fast in-memory cache mutation (diff against `previousAtModes`, update
`currentSummaries`/`historicSummaries`, prune >24h entries). The lock's real job —
guarding the cache structure between the `listen()` writer and the
`onWebSocketConnect()` reader — is preserved; the slow DB scan no longer blocks
connecting clients. Behavior (including the "skip if unchanged" `continue` and
historic pruning) is unchanged.

---

## P7 — NTP.java: unbounded, unnamed NTP poll pool (~25 threads/min)

**File:** `src/main/java/org/qortal/utils/NTP.java`
**Status:** applied, **not yet committed**

**Problem:** The anonymous `pool-5-thread` churner doing short-lived work was
NTP's `serverExecutor = Executors.newCachedThreadPool()`. It submits one short
task per NTP server each poll cycle; because per-server poll intervals can exceed
the cached pool's 60s keepalive, its threads kept dying between polls and being
recreated. Anonymous Executors pools also surface as the unhelpful
`pool-N-thread-M` name in profiles.

**Fix:** Replaced with a fixed pool sized to the server count, using a named
daemon factory:

```java
serverExecutor = Executors.newFixedThreadPool(
        Math.max(1, ntpServers.size()),
        new DaemonThreadFactory("NTP-poll", Thread.NORM_PRIORITY));
```

Reuses warm threads each cycle (no churn), bounds concurrency to the number of
servers, and names threads for future profiling. (`pool-5` couldn't be confirmed
as NTP without a runtime thread dump, but it's the strongest-matching candidate
and bounding/naming this unbounded anonymous pool is correct regardless — the
next profile will name it explicitly.)

---

## P6 — Redundant X25519 shared-secret computation in the handshake

**Files:** `src/main/java/org/qortal/crypto/Crypto.java`,
`src/main/java/org/qortal/network/Network.java`,
`src/main/java/org/qortal/network/NetworkData.java`,
`src/main/java/org/qortal/network/Peer.java`,
`src/main/java/org/qortal/network/Handshake.java`
**Status:** applied, **not yet committed**

**Context — this is the lowest-impact JFR item.** `X25519Field.create()` (`new int[10]`)
showed 14,124 allocation events but ranked only **14th** by allocation *weight*: ≈0.8 MB
over the 61-min window of tiny, short-lived young-gen arrays that never reach old gen. It
is **not a GC driver.** The real cost is the CPU of the X25519 scalar multiplication, and
two redundancies in the handshake path. The analysis's suggested options were rejected:
switching to JVM-native `KeyAgreement("XDH")` would risk the handshake's byte-for-byte
compatibility (Qortal uses a custom Ed25519→X25519 birational map + SHA-512 clamping), and
pooling `int[10]` arrays would require forking BouncyCastle — both poor trade-offs for
~0.8 MB/h. Only the two zero-wire-change wins (A + B) were applied. (P3's reduction in
handshake churn already lowers this proportionally.)

**Fix A — derive our X25519 private params once.** The node's Ed25519 key is fixed for
the JVM lifetime, yet `getSharedSecret` re-ran `toX25519PrivateKey` (a SHA-512 digest) and
`new X25519PrivateKeyParameters` on every call. Added `Crypto.toX25519PrivateKeyParams(...)`
plus a `getSharedSecret(X25519PrivateKeyParameters, byte[])` overload; `Network` and
`NetworkData` now cache `ourX25519PrivateKeyParams` once at construction and pass it in. The
existing `getSharedSecret(byte[], byte[])` is retained (delegates to the overload) for other
callers (e.g. `PrivateKeyAccount`).

**Fix B — compute the per-handshake secret once.** The shared secret was computed twice per
handshake — in `RESPONSE.action()` (sending our RESPONSE) and in the RESPONSE message handler
(validating the peer's). Both use the same `(our fixed key, peer's public key)` → identical
result. Added a `volatile byte[] handshakeSharedSecret` field + accessors on `Peer`, and a
`Handshake.getOrComputeSharedSecret(peer, peersPublicKey)` helper that computes once, caches
on the peer, and reuses. The concurrent compute is benign (deterministic result; worst case
matches the old double-compute). No wire bytes change — the secret is identical, just computed
once and reused.

---

## Remaining (not yet applied)

| ID | Description | Notes |
|----|-------------|-------|
| P8 | Restore heap headroom (`-Xmx`) | GC tuning — startup-script territory like P0; may already be set. |