Files
qortal/.claude/JFR_analysis.md
2026-05-31 14:59:14 +02:00

12 KiB
Raw Permalink Blame History

JFR Analysis — snap-140558.jfr

Host: masked info JVM: OpenJDK 17.0.19, G1GC, max heap 3.64 GB Recording: 2026-05-31 11:59:35 UTC, duration 384 seconds, 4 chunks, 22 MB


1. CPU Load During Recording

Metric Observed range
Machine total CPU 55 96%
JVM user CPU 20 50%
JVM system CPU 5 15%
System load avg (at snapshot) 6.11 on 4 cores (152% overloaded)

The machine is CPU-saturated throughout the recording. JVM user CPU is consistently high and system CPU (kernel/IO) adds another significant fraction.


2. GC — The Primary Problem

2.1 Overview

Metric Value
Total GC events 485 in 384 s (~1.26 GC/sec)
Young GC events 248
Old (concurrent) GC events 237
Total stop-the-world pause time 30,514 ms (8% of runtime)
Average pause 62.9 ms
Maximum pause 5,930 ms (5.93 seconds)

8% of application time is spent paused for GC. The 5.93 s max pause is catastrophic for a peer-to-peer node — it drops connections and misses block propagation windows.

2.2 GC Cause Breakdown

Cause Count %
G1 Humongous Allocation 474 97.7%
G1 Preventive Collection 9 1.9%
G1 Evacuation Pause 2 0.4%

97.7% of all GC events are triggered by humongous object allocations. This is the smoking gun.

2.3 What Is a Humongous Allocation?

G1GC divides the heap into equal-sized regions. Any single allocation ≥ 50% of the region size is a "humongous object" and is handled specially:

  • It is allocated directly in Old Gen (bypasses Eden entirely).
  • G1 triggers a Young GC immediately to reclaim Old Gen space when humongous objects accumulate.
  • Humongous objects fragment the heap and are expensive to free.

For a 3.64 GB max heap, G1 uses 2 MB regions → humongous threshold = 1 MB.


3. Root Cause: Peer.readChannel() Allocates 2 MB Buffers

3.1 The allocation

src/main/java/org/qortal/network/Peer.java, line 783:

if (this.byteBuffer == null) {
    this.byteBuffer = ByteBuffer.allocate(Network.getInstance().getMaxMessageSize());
}

getMaxMessageSize() is computed once in Network as:

maxMessageSize = 4 + 1 + 4 + BlockChain.getInstance().getMaxBlockSize();

maxBlockSize is set to 2,097,152 bytes in blockchain.json:

"maxBlockSize": 2097152

So every ByteBuffer.allocate call here allocates 2,097,161 bytes ≈ 2.000009 MB — just above the 2 MB region boundary, and double the 1 MB humongous threshold.

3.2 Why this triggers constant GCs

  • The buffer starts as null (line 720) and is freed (set to null) each time the buffer is empty after a complete message is consumed (line 847).
  • Every peer connection that reads even a single byte triggers a fresh 2 MB humongous allocation.
  • With 242+ live threads and active peer networking, this allocation fires continuously.
  • The JFR ObjectAllocationSample records 15,985 samples of byte[] originating from Peer.readChannel() line 783, with a sampled weight of 2.0 MB each — the single largest allocation source in the entire recording (40.7% of all allocation samples are byte[]).

3.3 Impact chain

Peer connects / sends data
  → byteBuffer == null → allocate 2,097,161 bytes
    → humongous object → lands in Old Gen
      → Old Gen fills → G1 Young GC fires (cause: "G1 Humongous Allocation")
        → STW pause 63 ms avg, up to 5,930 ms max
          → application stalls, connections time out

4. CPU Hotspots

4.1 Java execution samples (8,366 total)

Thread Samples %
Arbitrary Data group (all threads) ~4,172 49.9%
Synchronizer 995 11.9%
pool-26-thread-1 562 6.7%
AT States trimmer / pool-* 486 5.8%
BlockMinter 412 4.9%
pool-16-thread-1 382 4.6%
Foreign Fee Manager-1 277 3.3%

4.2 Native method samples (17,519 total)

Thread Samples % Note
RMI TCP Connection threads 2,956 16.9% JMX/diagnostic traffic
HttpClient-1-SelectorManager 1,478 8.4%
qtp406511188-272/273/274 ~4,431 25.3% Jetty HTTP threads (SSL)
grpc-nio-worker-ELG-1-2/3/4 ~4,431 25.3% gRPC NIO workers
Network-IO 1,320 7.5%
NetworkData-IO 1,247 7.1%
Arbitrary Data threads 570 3.3%

The Jetty + gRPC NIO workers dominate native samples, indicating heavy network I/O in kernel space (epoll/select syscalls).

4.3 Arbitrary Data subsystem — what it is actually doing

The execution samples for all Arbitrary Data threads point to the same HSQLDB call chain:

QuerySpecification.buildResult()
  → RangeVariable$RangeIteratorMain.next()
    → IndexAVL$IndexRowIterator.getNextRow()
      → IndexAVL.next()
        → NodeAVLDisk.getLeft/getRight/getParent()
          → RowStoreAVLDisk.get()
            → DataFileCache.getFromFile()   ← disk read under WriteLock
              → DataFileCache.get()

This is a full AVL index scan on a disk-backed HSQLDB table, repeated continuously by multiple threads (Arbitrary Data Manager, Arbitrary Index Cache Timer Task, Arbitrary Data Cleanup Manager, Arbitrary Data Cache Manager). The ReentrantReadWriteLock$WriteLock.lock() on DataFileCache causes cross-thread contention on every disk read.

Additionally, Cache.cleanUp() / Cache.preparePut() appear in samples, meaning the HSQLDB row cache is under constant eviction pressure — it cannot hold the working set in memory.

4.4 Foreign Fee Manager — EdDSA in a hot path

Foreign Fee Manager-1 (277 Java execution samples + 12 native samples) is burning CPU on:

Crypto.verify()
  → Ed25519.implVerify()
    → Ed25519.scalarMultStraus128Var()
      → Wnaf.getSignedVar()

Ed25519 signature verification is expensive on aarch64 without hardware acceleration. If processForeignFeesImportQueue() is verifying signatures in a tight scheduler loop, it should batch or throttle.


5. Object Allocation Profile

5.1 Allocation sample type breakdown (85,503 total samples)

Type Samples % Primary source
byte[] 34,837 40.7% Peer.readChannel() (2 MB buffers) + HSQLDB binary reads
NodeAVLDisk 6,278 7.3% HSQLDB index node reads
Peer$PreSerializedMessageWrapper 5,013 5.9% Outbound peer message serialization
AQS$ConditionNode 4,505 5.3% Lock condition queue churn
Object[] 3,974 4.6% General
int[] 3,554 4.2% HSQLDB / general
RowAVLDisk 1,890 2.2% HSQLDB row reads
BinaryData 1,687 2.0% HSQLDB binary column data
Long 1,510 1.8% Autoboxing

5.2 Promotion to Old Gen (PromoteObjectInNewPLAB)

  • 66,987 total promote events; 60,064 (89.7%) tenured = true.
  • Nearly 90% of promoted objects survive to Old Gen, confirming the working set is too large for Young Gen to absorb.
  • plabSize = 1.0 MB appears 4,989 times — large 1 MB PLAB chunks being allocated in Old Gen for byte[], BinaryData, and NodeAVLDisk promotions.

5.3 Old Object Sample (741 events — long-lived objects)

All 741 OldObjectSample events record object = [ (arrays), consistent with the 2 MB byte[] buffers accumulating in Old Gen between concurrent GC cycles.


6. Memory State

Pool Used Committed Max
G1 Eden Space 64 MB 360 MB dynamic
G1 Old Gen 1.19 GB 1.54 GB 3.64 GB
G1 Survivor Space 44 MB 44 MB dynamic
Metaspace 76 MB unlimited
CodeHeap (all) 28.8 MB
  • Old Gen grew from 897 MB → 1.19 GB (+300 MB) in ~3 minutes of the live observation window.
  • G1 Eden has 360 MB committed but only 64 MB used — G1 is not utilising Eden efficiently because most large allocations skip it entirely (humongous path).
  • Host free physical memory: 74.7 MB of 7.28 GB. JVM committed virtual memory (7.52 GB) exceeds total RAM — the OS is paging.

7. Thread State Snapshot

State Count
RUNNABLE 20
WAITING 124
TIMED_WAITING 98
Live total 242 (peak 270, 860 started)
  • No deadlocks.
  • No blocked threads.
  • The high WAITING + TIMED_WAITING count with non-trivial CPU burn (seen in the JMX sample) is explained by threads that loop through work, block briefly on a lock or park, and immediately resume — appearing "waiting" in a point-in-time snapshot but consuming real CPU over time.

8. Summary of Findings

# Finding Severity
1 Peer.readChannel() allocates a 2 MB ByteBuffer per message cycle; humongous threshold is 1 MB → 97.7% of GCs triggered by this Critical
2 Max GC pause of 5.93 seconds; 8% of runtime in STW pauses Critical
3 Old Gen growing ~100 MB/min; Full GC imminent if growth continues High
4 Physical RAM exhausted (74 MB free); JVM committed memory exceeds RAM → OS paging High
5 Arbitrary Data subsystem runs continuous full AVL index scans on disk-backed HSQLDB High
6 HSQLDB DataFileCache under constant WriteLock contention across 4+ threads Medium
7 Foreign Fee Manager verifies Ed25519 signatures in a tight scheduler loop Medium
8 89.7% of Young Gen promotions go directly to Old Gen — Young Gen is not filtering effectively Medium

Fix 1 — Eliminate humongous ByteBuffer allocations (Critical)

Option A (JVM flag — immediate mitigation, no code change): Add -XX:G1HeapRegionSize=4m to the JVM launch flags. This doubles the region size, moving the humongous threshold from 1 MB to 2 MB. The 2,097,161-byte buffers drop below the threshold and allocate normally in Eden.

Option B (code fix — correct solution): Do not set byteBuffer = null after each message. Keep the buffer allocated for the lifetime of the connection and clear() / compact() it instead. This means one 2 MB allocation per peer connection (not per message), and since it stays in Old Gen only once, it stops triggering repeated GC cycles.

Option C (deeper fix): Allocate a small initial buffer and grow it only when a large message is detected from the header length. Most messages are small; only block messages need 2 MB.

Fix 2 — Reduce HSQLDB full-scan frequency (High)

The Arbitrary Data threads run full IndexAVL scans on every tick. Determine whether the queries driving Arbitrary Data Manager, Arbitrary Index Cache Timer Task, and Arbitrary Data Cleanup Manager have appropriate indexes, and whether their scheduling interval can be increased.

Fix 3 — Throttle Ed25519 verification in Foreign Fee Manager (Medium)

Batch signature verification or add a rate limit / back-pressure to processForeignFeesImportQueue() so it does not monopolize CPU on each scheduler tick.


10. Implementation Status

# Finding Severity Status Commit
1 Peer.readChannel() 2 MB humongous ByteBuffer per message cycle Critical Fixed b0895f02
2 Max GC pause 5.93 s; 8% of runtime in STW pauses Critical Resolved by #1
3 Old Gen growing ~100 MB/min; Full GC imminent High Resolved by #1
4 Physical RAM exhausted; JVM committed memory exceeds RAM → OS paging High Fixed 4134edb6
5 Arbitrary Data subsystem runs continuous full AVL index scans on disk-backed HSQLDB High Fixed e3de6944
6 HSQLDB DataFileCache under constant WriteLock contention across 4+ threads Medium Resolved by #5
7 Foreign Fee Manager verifies Ed25519 signatures in a tight scheduler loop Medium Fixed 2ef1e3f8
8 89.7% of Young Gen promotions go directly to Old Gen Medium Resolved by #1