Files
qortal/.claude/JFR_changes_3.md

13 KiB
Raw Permalink Blame History

JFR / GC Analysis (Run 2) — Changes Applied

Reference analysis: /home/nicola/Downloads/_QORTA/analysis.md (see also .claude/JFR_analysis_3.md)

This is the third round of fixes. The reference analysis confirmed the OOM trajectory from Run 1 is fixed (heap stable 13+ hours) but surfaced new contention, CPU, exception-rate and thread-churn findings. Priorities below use the analysis.md numbering (P0P8).


P0 — G1HeapOccupancyPercent=35 (NOT changed here)

Already present in the node startup script — intentionally skipped in code.


P1 — ArbitraryDataBuilderThread.java: DB status check held a monitor lock

File: src/main/java/org/qortal/controller/arbitrary/ArbitraryDataBuilderThread.java Commit: dfb7f551

Problem: All 5 builder threads contend on synchronized (buildManager.arbitraryDataBuildQueue). Inside that block the old code called ArbitraryTransactionUtils.getStatus(...), which opens a DB connection, runs a query, writes a status row and commits — potentially 10+ seconds under load. While one thread held the monitor doing that DB work, the other 4 builder threads were all blocked. This is what the JFR caught as ConcurrentHashMap / JavaMonitorEnter contention with waits up to ~11.6s (the analysis attributed it to "computeIfAbsent slow compute", but the real mechanism was the DB call under the queue monitor).

Fix: prepareForBuild() (which sets buildStartTimestamp, marking the item in-progress so other threads' isQueued() filter skips it) now runs inside the lock to reserve the item. The DB getStatus() call was moved outside the synchronized block. The 5 builder threads can now overlap their DB calls instead of serializing through the queue monitor.


P4 — pom.xml: Jackson library stack split across incompatible versions

File: pom.xml Commit: e36bdd3f

Problem: The Jackson ecosystem was split across two versions on the runtime classpath:

Version Artifacts Source
2.18.2 jackson-jaxrs-json-provider, jackson-jaxrs-base, jackson-module-jaxb-annotations, jackson-dataformat-yaml, jackson-datatype-jsr310 transitive via swagger-core:2.2.30
2.15.2 jackson-databind, jackson-core, jackson-annotations explicitly pinned in pom.xml

The jackson-jaxrs-json-provider:2.18.2 is compiled against jackson-databind:2.18.2 internals but ran against 2.15.2, throwing NoSuchMethodError / IncompatibleClassChangeError on the DirectMethodHandle$Holder path for every JAX-RS request that serialized a JSON response — the ~124 exceptions/second (≈5.8M total/run) seen on the Jetty worker threads.

Fix: Imported the official Jackson BOM in a new <dependencyManagement> section:

<dependency>
  <groupId>com.fasterxml.jackson</groupId>
  <artifactId>jackson-bom</artifactId>
  <version>${jackson.version}</version>   <!-- 2.18.2 -->
  <type>pom</type>
  <scope>import</scope>
</dependency>

Added a jackson.version=2.18.2 property and removed the now-redundant hardcoded 2.15.2 from the direct jackson-databind dependency. mvn dependency:tree confirms all 8 Jackson artifacts now resolve to 2.18.2; the project compiles cleanly against the bumped databind. The 2.15.2 pin came from the "JSON import/export support" commit and was an arbitrary version choice (the "latest stable" comment was wrong — swagger already required 2.18.2), so aligning upward carries no known regression and picks up later security fixes.


P5 — HSQLDBDatabaseUpdates.java: cache evicting on row-count, not bytes

File: src/main/java/org/qortal/repository/hsqldb/HSQLDBDatabaseUpdates.java Commit: 7170f539

Problem: BaseHashMap.clearToHalf() was the #1 CPU hotspot at 10.6%, driven by HSQLDB Cache.cleanUp(). Verified against the HSQLDB 2.7.4 source: the row cache has two independent limits and evicts when either is hit (Cache.preparePut()exceedsCount || exceedsSize):

  • hsqldb.cache_size (SET FILES CACHE SIZE) — default 10,000 KB
  • hsqldb.cache_rows (SET FILES CACHE ROWS) — default 50,000 rows

A prior commit (7a9a80a7) ran SET FILES CACHE SIZE 200000, raising only the byte budget to ~200 MB — CACHE SIZE is in kilobytes, not rows (the commit comment conflated the two). The row-count limit stayed at its default 50,000. Because the JFR was captured on this branch after that commit and still showed clearToHalf thrashing, the working set is row-count-bound: it hits 50,000 rows long before 200 MB, so the byte bump never took effect.

Fix: Added migration case 53: running SET FILES CACHE ROWS 200000, so the cache can grow into the 200 MB byte budget already configured in case 52. Verified in the Cache constructor that the backing arrays are pre-allocated to the CACHE ROWS value, so 200,000 (matching the prior commit's stated intent and the analysis's 100k500k recommendation) was chosen rather than an arbitrarily huge value — ~58 MB of index arrays, with memory still hard-bounded by the unchanged 200 MB byte cap (heap-safe). The misleading comment on case 52 was also corrected. A new case 53: (rather than editing case 52) ensures already-upgraded nodes pick up the change on the next version bump.


P3 — Network.java: Network-Worker thread churn (~53 created/min)

File: src/main/java/org/qortal/network/Network.java Status: applied, not yet committed

Problem: Network-Worker threads were created and destroyed at ~53/min, contributing to ~4% CPU in ThreadLocalMap.set/remove (stale-entry cleanup runs on every set/remove when threads churn) plus safepoint overhead. The churn is intrinsic to the ExecuteProduceConsume (EPC) algorithm: it spawns a logical worker whenever every active thread becomes a consumer, and exits workers when there's an idle-producer excess. The backing ThreadPoolExecutor was core=2, max=512, keepalive=5s, SynchronousQueue, so almost every EPC spawn created a brand-new physical Thread that then died within 5s — and physical thread create/destroy is what churns ThreadLocalMap.

Fix: Keep physical threads warm for reuse:

  • core 2 → 10 (NETWORK_EPC_CORE_THREADS) — a pinned warm baseline (core threads never time out) absorbs steady-state load with zero churn.
  • keepalive 5s → 60s (NETWORK_EPC_KEEPALIVE) — burst-overflow threads (above core) linger long enough to be reused instead of respawned.

Deliberately NOT changed: max=512 and the SynchronousQueue. The analysis suggested capping max at ~8 with a queue, but that is dangerous for this EPC: it breaks the required immediate producer→consumer handoff and would cause RejectedExecutionException storms / network stalls under burst. The sibling NetworkData-Worker pool was verified to already be churn-free (core=10 with an unbounded LinkedBlockingQueue never grows past 10 threads) and was left alone.


P2 — TradeOffersWebSocket.java: DB queries under a java.util.HashMap monitor

File: src/main/java/org/qortal/api/websocket/TradeOffersWebSocket.java Status: applied, not yet committed

Problem: The analysis flagged JavaMonitorEnter contention on a raw java.util.HashMap (20 events, 5,570ms, ~278ms average wait). The JavaMonitorEnter class is the key clue: a data race without synchronization produces no monitor events, so this had to be an explicit synchronized (someHashMap). After ruling out the wrapper-typed (Collections$SynchronizedMap) and non-map lock objects, and the plain-HashMap monitors that are only ever touched single-threaded (AccountRefCache, used solely from single-threaded Block.process()/orphan()) or hold no I/O (ChatNotifier.listenersBySession, TradeBotWebSocket.PREVIOUS_STATES), the culprit was TradeOffersWebSocket.cachedInfoByBlockchain (new HashMap<>()).

Its listen() method (a NewChainTipEvent handler, fired every block) held synchronized (cachedInfoByBlockchain) across a loop of repository.getATRepository().getMatchingFinalATStates(...) + produceSummaries(...) DB queries — once per supported blockchain, per ACCT. Meanwhile onWebSocketConnect() also locks cachedInfoByBlockchain to read initial data, so every client connecting during a block's AT-state scan blocked for the full scan duration. (Not a corruption risk — it is synchronized — but the monitor was held far too long.)

Fix: Moved all repository/DB work in listen() outside the synchronized (cachedInfoByBlockchain) block. The DB queries build the crossChainOfferSummaries list with no lock held; the monitor is then taken only for the fast in-memory cache mutation (diff against previousAtModes, update currentSummaries/historicSummaries, prune >24h entries). The lock's real job — guarding the cache structure between the listen() writer and the onWebSocketConnect() reader — is preserved; the slow DB scan no longer blocks connecting clients. Behavior (including the "skip if unchanged" continue and historic pruning) is unchanged.


P7 — NTP.java: unbounded, unnamed NTP poll pool (~25 threads/min)

File: src/main/java/org/qortal/utils/NTP.java Status: applied, not yet committed

Problem: The anonymous pool-5-thread churner doing short-lived work was NTP's serverExecutor = Executors.newCachedThreadPool(). It submits one short task per NTP server each poll cycle; because per-server poll intervals can exceed the cached pool's 60s keepalive, its threads kept dying between polls and being recreated. Anonymous Executors pools also surface as the unhelpful pool-N-thread-M name in profiles.

Fix: Replaced with a fixed pool sized to the server count, using a named daemon factory:

serverExecutor = Executors.newFixedThreadPool(
        Math.max(1, ntpServers.size()),
        new DaemonThreadFactory("NTP-poll", Thread.NORM_PRIORITY));

Reuses warm threads each cycle (no churn), bounds concurrency to the number of servers, and names threads for future profiling. (pool-5 couldn't be confirmed as NTP without a runtime thread dump, but it's the strongest-matching candidate and bounding/naming this unbounded anonymous pool is correct regardless — the next profile will name it explicitly.)


P6 — Redundant X25519 shared-secret computation in the handshake

Files: src/main/java/org/qortal/crypto/Crypto.java, src/main/java/org/qortal/network/Network.java, src/main/java/org/qortal/network/NetworkData.java, src/main/java/org/qortal/network/Peer.java, src/main/java/org/qortal/network/Handshake.java Status: applied, not yet committed

Context — this is the lowest-impact JFR item. X25519Field.create() (new int[10]) showed 14,124 allocation events but ranked only 14th by allocation weight: ≈0.8 MB over the 61-min window of tiny, short-lived young-gen arrays that never reach old gen. It is not a GC driver. The real cost is the CPU of the X25519 scalar multiplication, and two redundancies in the handshake path. The analysis's suggested options were rejected: switching to JVM-native KeyAgreement("XDH") would risk the handshake's byte-for-byte compatibility (Qortal uses a custom Ed25519→X25519 birational map + SHA-512 clamping), and pooling int[10] arrays would require forking BouncyCastle — both poor trade-offs for ~0.8 MB/h. Only the two zero-wire-change wins (A + B) were applied. (P3's reduction in handshake churn already lowers this proportionally.)

Fix A — derive our X25519 private params once. The node's Ed25519 key is fixed for the JVM lifetime, yet getSharedSecret re-ran toX25519PrivateKey (a SHA-512 digest) and new X25519PrivateKeyParameters on every call. Added Crypto.toX25519PrivateKeyParams(...) plus a getSharedSecret(X25519PrivateKeyParameters, byte[]) overload; Network and NetworkData now cache ourX25519PrivateKeyParams once at construction and pass it in. The existing getSharedSecret(byte[], byte[]) is retained (delegates to the overload) for other callers (e.g. PrivateKeyAccount).

Fix B — compute the per-handshake secret once. The shared secret was computed twice per handshake — in RESPONSE.action() (sending our RESPONSE) and in the RESPONSE message handler (validating the peer's). Both use the same (our fixed key, peer's public key) → identical result. Added a volatile byte[] handshakeSharedSecret field + accessors on Peer, and a Handshake.getOrComputeSharedSecret(peer, peersPublicKey) helper that computes once, caches on the peer, and reuses. The concurrent compute is benign (deterministic result; worst case matches the old double-compute). No wire bytes change — the secret is identical, just computed once and reused.


Remaining (not yet applied)

ID Description Notes
P8 Restore heap headroom (-Xmx) GC tuning — startup-script territory like P0; may already be set.