13 KiB
JFR / GC Analysis (Run 2) — Changes Applied
Reference analysis: /home/nicola/Downloads/_QORTA/analysis.md (see also .claude/JFR_analysis_3.md)
This is the third round of fixes. The reference analysis confirmed the OOM trajectory from Run 1 is fixed (heap stable 13+ hours) but surfaced new contention, CPU, exception-rate and thread-churn findings. Priorities below use the analysis.md numbering (P0–P8).
P0 — G1HeapOccupancyPercent=35 (NOT changed here)
Already present in the node startup script — intentionally skipped in code.
P1 — ArbitraryDataBuilderThread.java: DB status check held a monitor lock
File: src/main/java/org/qortal/controller/arbitrary/ArbitraryDataBuilderThread.java
Commit: dfb7f551
Problem: All 5 builder threads contend on
synchronized (buildManager.arbitraryDataBuildQueue). Inside that block the old
code called ArbitraryTransactionUtils.getStatus(...), which opens a DB
connection, runs a query, writes a status row and commits — potentially 10+
seconds under load. While one thread held the monitor doing that DB work, the
other 4 builder threads were all blocked. This is what the JFR caught as
ConcurrentHashMap / JavaMonitorEnter contention with waits up to ~11.6s
(the analysis attributed it to "computeIfAbsent slow compute", but the real
mechanism was the DB call under the queue monitor).
Fix: prepareForBuild() (which sets buildStartTimestamp, marking the item
in-progress so other threads' isQueued() filter skips it) now runs inside
the lock to reserve the item. The DB getStatus() call was moved outside the
synchronized block. The 5 builder threads can now overlap their DB calls instead
of serializing through the queue monitor.
P4 — pom.xml: Jackson library stack split across incompatible versions
File: pom.xml
Commit: e36bdd3f
Problem: The Jackson ecosystem was split across two versions on the runtime classpath:
| Version | Artifacts | Source |
|---|---|---|
| 2.18.2 | jackson-jaxrs-json-provider, jackson-jaxrs-base, jackson-module-jaxb-annotations, jackson-dataformat-yaml, jackson-datatype-jsr310 | transitive via swagger-core:2.2.30 |
| 2.15.2 | jackson-databind, jackson-core, jackson-annotations | explicitly pinned in pom.xml |
The jackson-jaxrs-json-provider:2.18.2 is compiled against
jackson-databind:2.18.2 internals but ran against 2.15.2, throwing
NoSuchMethodError / IncompatibleClassChangeError on the DirectMethodHandle$Holder
path for every JAX-RS request that serialized a JSON response — the ~124
exceptions/second (≈5.8M total/run) seen on the Jetty worker threads.
Fix: Imported the official Jackson BOM in a new <dependencyManagement>
section:
<dependency>
<groupId>com.fasterxml.jackson</groupId>
<artifactId>jackson-bom</artifactId>
<version>${jackson.version}</version> <!-- 2.18.2 -->
<type>pom</type>
<scope>import</scope>
</dependency>
Added a jackson.version=2.18.2 property and removed the now-redundant
hardcoded 2.15.2 from the direct jackson-databind dependency. mvn dependency:tree confirms all 8 Jackson artifacts now resolve to 2.18.2; the
project compiles cleanly against the bumped databind. The 2.15.2 pin came from
the "JSON import/export support" commit and was an arbitrary version choice (the
"latest stable" comment was wrong — swagger already required 2.18.2), so
aligning upward carries no known regression and picks up later security fixes.
P5 — HSQLDBDatabaseUpdates.java: cache evicting on row-count, not bytes
File: src/main/java/org/qortal/repository/hsqldb/HSQLDBDatabaseUpdates.java
Commit: 7170f539
Problem: BaseHashMap.clearToHalf() was the #1 CPU hotspot at 10.6%, driven
by HSQLDB Cache.cleanUp(). Verified against the HSQLDB 2.7.4 source: the row
cache has two independent limits and evicts when either is hit
(Cache.preparePut() → exceedsCount || exceedsSize):
hsqldb.cache_size(SET FILES CACHE SIZE) — default 10,000 KBhsqldb.cache_rows(SET FILES CACHE ROWS) — default 50,000 rows
A prior commit (7a9a80a7) ran SET FILES CACHE SIZE 200000, raising only the
byte budget to ~200 MB — CACHE SIZE is in kilobytes, not rows (the commit
comment conflated the two). The row-count limit stayed at its default 50,000.
Because the JFR was captured on this branch after that commit and still showed
clearToHalf thrashing, the working set is row-count-bound: it hits 50,000 rows
long before 200 MB, so the byte bump never took effect.
Fix: Added migration case 53: running SET FILES CACHE ROWS 200000, so the
cache can grow into the 200 MB byte budget already configured in case 52.
Verified in the Cache constructor that the backing arrays are pre-allocated to
the CACHE ROWS value, so 200,000 (matching the prior commit's stated intent and
the analysis's 100k–500k recommendation) was chosen rather than an arbitrarily
huge value — ~5–8 MB of index arrays, with memory still hard-bounded by the
unchanged 200 MB byte cap (heap-safe). The misleading comment on case 52 was also
corrected. A new case 53: (rather than editing case 52) ensures already-upgraded
nodes pick up the change on the next version bump.
P3 — Network.java: Network-Worker thread churn (~53 created/min)
File: src/main/java/org/qortal/network/Network.java
Status: applied, not yet committed
Problem: Network-Worker threads were created and destroyed at ~53/min,
contributing to ~4% CPU in ThreadLocalMap.set/remove (stale-entry cleanup
runs on every set/remove when threads churn) plus safepoint overhead. The churn
is intrinsic to the ExecuteProduceConsume (EPC) algorithm: it spawns a logical
worker whenever every active thread becomes a consumer, and exits workers when
there's an idle-producer excess. The backing ThreadPoolExecutor was
core=2, max=512, keepalive=5s, SynchronousQueue, so almost every EPC spawn
created a brand-new physical Thread that then died within 5s — and physical
thread create/destroy is what churns ThreadLocalMap.
Fix: Keep physical threads warm for reuse:
core2 → 10 (NETWORK_EPC_CORE_THREADS) — a pinned warm baseline (core threads never time out) absorbs steady-state load with zero churn.keepalive5s → 60s (NETWORK_EPC_KEEPALIVE) — burst-overflow threads (above core) linger long enough to be reused instead of respawned.
Deliberately NOT changed: max=512 and the SynchronousQueue. The analysis
suggested capping max at ~8 with a queue, but that is dangerous for this EPC: it
breaks the required immediate producer→consumer handoff and would cause
RejectedExecutionException storms / network stalls under burst. The sibling
NetworkData-Worker pool was verified to already be churn-free (core=10 with an
unbounded LinkedBlockingQueue never grows past 10 threads) and was left alone.
P2 — TradeOffersWebSocket.java: DB queries under a java.util.HashMap monitor
File: src/main/java/org/qortal/api/websocket/TradeOffersWebSocket.java
Status: applied, not yet committed
Problem: The analysis flagged JavaMonitorEnter contention on a raw
java.util.HashMap (20 events, 5,570ms, ~278ms average wait). The JavaMonitorEnter
class is the key clue: a data race without synchronization produces no monitor
events, so this had to be an explicit synchronized (someHashMap). After ruling out
the wrapper-typed (Collections$SynchronizedMap) and non-map lock objects, and the
plain-HashMap monitors that are only ever touched single-threaded (AccountRefCache,
used solely from single-threaded Block.process()/orphan()) or hold no I/O
(ChatNotifier.listenersBySession, TradeBotWebSocket.PREVIOUS_STATES), the culprit
was TradeOffersWebSocket.cachedInfoByBlockchain (new HashMap<>()).
Its listen() method (a NewChainTipEvent handler, fired every block) held
synchronized (cachedInfoByBlockchain) across a loop of
repository.getATRepository().getMatchingFinalATStates(...) + produceSummaries(...)
DB queries — once per supported blockchain, per ACCT. Meanwhile
onWebSocketConnect() also locks cachedInfoByBlockchain to read initial data, so
every client connecting during a block's AT-state scan blocked for the full scan
duration. (Not a corruption risk — it is synchronized — but the monitor was held
far too long.)
Fix: Moved all repository/DB work in listen() outside the
synchronized (cachedInfoByBlockchain) block. The DB queries build the
crossChainOfferSummaries list with no lock held; the monitor is then taken only for
the fast in-memory cache mutation (diff against previousAtModes, update
currentSummaries/historicSummaries, prune >24h entries). The lock's real job —
guarding the cache structure between the listen() writer and the
onWebSocketConnect() reader — is preserved; the slow DB scan no longer blocks
connecting clients. Behavior (including the "skip if unchanged" continue and
historic pruning) is unchanged.
P7 — NTP.java: unbounded, unnamed NTP poll pool (~25 threads/min)
File: src/main/java/org/qortal/utils/NTP.java
Status: applied, not yet committed
Problem: The anonymous pool-5-thread churner doing short-lived work was
NTP's serverExecutor = Executors.newCachedThreadPool(). It submits one short
task per NTP server each poll cycle; because per-server poll intervals can exceed
the cached pool's 60s keepalive, its threads kept dying between polls and being
recreated. Anonymous Executors pools also surface as the unhelpful
pool-N-thread-M name in profiles.
Fix: Replaced with a fixed pool sized to the server count, using a named daemon factory:
serverExecutor = Executors.newFixedThreadPool(
Math.max(1, ntpServers.size()),
new DaemonThreadFactory("NTP-poll", Thread.NORM_PRIORITY));
Reuses warm threads each cycle (no churn), bounds concurrency to the number of
servers, and names threads for future profiling. (pool-5 couldn't be confirmed
as NTP without a runtime thread dump, but it's the strongest-matching candidate
and bounding/naming this unbounded anonymous pool is correct regardless — the
next profile will name it explicitly.)
P6 — Redundant X25519 shared-secret computation in the handshake
Files: src/main/java/org/qortal/crypto/Crypto.java,
src/main/java/org/qortal/network/Network.java,
src/main/java/org/qortal/network/NetworkData.java,
src/main/java/org/qortal/network/Peer.java,
src/main/java/org/qortal/network/Handshake.java
Status: applied, not yet committed
Context — this is the lowest-impact JFR item. X25519Field.create() (new int[10])
showed 14,124 allocation events but ranked only 14th by allocation weight: ≈0.8 MB
over the 61-min window of tiny, short-lived young-gen arrays that never reach old gen. It
is not a GC driver. The real cost is the CPU of the X25519 scalar multiplication, and
two redundancies in the handshake path. The analysis's suggested options were rejected:
switching to JVM-native KeyAgreement("XDH") would risk the handshake's byte-for-byte
compatibility (Qortal uses a custom Ed25519→X25519 birational map + SHA-512 clamping), and
pooling int[10] arrays would require forking BouncyCastle — both poor trade-offs for
~0.8 MB/h. Only the two zero-wire-change wins (A + B) were applied. (P3's reduction in
handshake churn already lowers this proportionally.)
Fix A — derive our X25519 private params once. The node's Ed25519 key is fixed for
the JVM lifetime, yet getSharedSecret re-ran toX25519PrivateKey (a SHA-512 digest) and
new X25519PrivateKeyParameters on every call. Added Crypto.toX25519PrivateKeyParams(...)
plus a getSharedSecret(X25519PrivateKeyParameters, byte[]) overload; Network and
NetworkData now cache ourX25519PrivateKeyParams once at construction and pass it in. The
existing getSharedSecret(byte[], byte[]) is retained (delegates to the overload) for other
callers (e.g. PrivateKeyAccount).
Fix B — compute the per-handshake secret once. The shared secret was computed twice per
handshake — in RESPONSE.action() (sending our RESPONSE) and in the RESPONSE message handler
(validating the peer's). Both use the same (our fixed key, peer's public key) → identical
result. Added a volatile byte[] handshakeSharedSecret field + accessors on Peer, and a
Handshake.getOrComputeSharedSecret(peer, peersPublicKey) helper that computes once, caches
on the peer, and reuses. The concurrent compute is benign (deterministic result; worst case
matches the old double-compute). No wire bytes change — the secret is identical, just computed
once and reused.
Remaining (not yet applied)
| ID | Description | Notes |
|---|---|---|
| P8 | Restore heap headroom (-Xmx) |
GC tuning — startup-script territory like P0; may already be set. |