12 KiB
JFR Analysis — snap-140558.jfr
Host: masked info JVM: OpenJDK 17.0.19, G1GC, max heap 3.64 GB Recording: 2026-05-31 11:59:35 UTC, duration 384 seconds, 4 chunks, 22 MB
1. CPU Load During Recording
| Metric | Observed range |
|---|---|
| Machine total CPU | 55 – 96% |
| JVM user CPU | 20 – 50% |
| JVM system CPU | 5 – 15% |
| System load avg (at snapshot) | 6.11 on 4 cores (152% overloaded) |
The machine is CPU-saturated throughout the recording. JVM user CPU is consistently high and system CPU (kernel/IO) adds another significant fraction.
2. GC — The Primary Problem
2.1 Overview
| Metric | Value |
|---|---|
| Total GC events | 485 in 384 s (~1.26 GC/sec) |
| Young GC events | 248 |
| Old (concurrent) GC events | 237 |
| Total stop-the-world pause time | 30,514 ms (8% of runtime) |
| Average pause | 62.9 ms |
| Maximum pause | 5,930 ms (5.93 seconds) |
8% of application time is spent paused for GC. The 5.93 s max pause is catastrophic for a peer-to-peer node — it drops connections and misses block propagation windows.
2.2 GC Cause Breakdown
| Cause | Count | % |
|---|---|---|
| G1 Humongous Allocation | 474 | 97.7% |
| G1 Preventive Collection | 9 | 1.9% |
| G1 Evacuation Pause | 2 | 0.4% |
97.7% of all GC events are triggered by humongous object allocations. This is the smoking gun.
2.3 What Is a Humongous Allocation?
G1GC divides the heap into equal-sized regions. Any single allocation ≥ 50% of the region size is a "humongous object" and is handled specially:
- It is allocated directly in Old Gen (bypasses Eden entirely).
- G1 triggers a Young GC immediately to reclaim Old Gen space when humongous objects accumulate.
- Humongous objects fragment the heap and are expensive to free.
For a 3.64 GB max heap, G1 uses 2 MB regions → humongous threshold = 1 MB.
3. Root Cause: Peer.readChannel() Allocates 2 MB Buffers
3.1 The allocation
src/main/java/org/qortal/network/Peer.java, line 783:
if (this.byteBuffer == null) {
this.byteBuffer = ByteBuffer.allocate(Network.getInstance().getMaxMessageSize());
}
getMaxMessageSize() is computed once in Network as:
maxMessageSize = 4 + 1 + 4 + BlockChain.getInstance().getMaxBlockSize();
maxBlockSize is set to 2,097,152 bytes in blockchain.json:
"maxBlockSize": 2097152
So every ByteBuffer.allocate call here allocates 2,097,161 bytes ≈ 2.000009 MB — just above the 2 MB region boundary, and double the 1 MB humongous threshold.
3.2 Why this triggers constant GCs
- The buffer starts as
null(line 720) and is freed (set tonull) each time the buffer is empty after a complete message is consumed (line 847). - Every peer connection that reads even a single byte triggers a fresh 2 MB humongous allocation.
- With 242+ live threads and active peer networking, this allocation fires continuously.
- The JFR
ObjectAllocationSamplerecords 15,985 samples ofbyte[]originating fromPeer.readChannel()line 783, with a sampled weight of 2.0 MB each — the single largest allocation source in the entire recording (40.7% of all allocation samples arebyte[]).
3.3 Impact chain
Peer connects / sends data
→ byteBuffer == null → allocate 2,097,161 bytes
→ humongous object → lands in Old Gen
→ Old Gen fills → G1 Young GC fires (cause: "G1 Humongous Allocation")
→ STW pause 63 ms avg, up to 5,930 ms max
→ application stalls, connections time out
4. CPU Hotspots
4.1 Java execution samples (8,366 total)
| Thread | Samples | % |
|---|---|---|
Arbitrary Data group (all threads) |
~4,172 | 49.9% |
Synchronizer |
995 | 11.9% |
pool-26-thread-1 |
562 | 6.7% |
AT States trimmer / pool-* |
486 | 5.8% |
BlockMinter |
412 | 4.9% |
pool-16-thread-1 |
382 | 4.6% |
Foreign Fee Manager-1 |
277 | 3.3% |
4.2 Native method samples (17,519 total)
| Thread | Samples | % | Note |
|---|---|---|---|
RMI TCP Connection threads |
2,956 | 16.9% | JMX/diagnostic traffic |
HttpClient-1-SelectorManager |
1,478 | 8.4% | |
qtp406511188-272/273/274 |
~4,431 | 25.3% | Jetty HTTP threads (SSL) |
grpc-nio-worker-ELG-1-2/3/4 |
~4,431 | 25.3% | gRPC NIO workers |
Network-IO |
1,320 | 7.5% | |
NetworkData-IO |
1,247 | 7.1% | |
Arbitrary Data threads |
570 | 3.3% |
The Jetty + gRPC NIO workers dominate native samples, indicating heavy network I/O in kernel space (epoll/select syscalls).
4.3 Arbitrary Data subsystem — what it is actually doing
The execution samples for all Arbitrary Data threads point to the same HSQLDB call chain:
QuerySpecification.buildResult()
→ RangeVariable$RangeIteratorMain.next()
→ IndexAVL$IndexRowIterator.getNextRow()
→ IndexAVL.next()
→ NodeAVLDisk.getLeft/getRight/getParent()
→ RowStoreAVLDisk.get()
→ DataFileCache.getFromFile() ← disk read under WriteLock
→ DataFileCache.get()
This is a full AVL index scan on a disk-backed HSQLDB table, repeated continuously by multiple threads (Arbitrary Data Manager, Arbitrary Index Cache Timer Task, Arbitrary Data Cleanup Manager, Arbitrary Data Cache Manager). The ReentrantReadWriteLock$WriteLock.lock() on DataFileCache causes cross-thread contention on every disk read.
Additionally, Cache.cleanUp() / Cache.preparePut() appear in samples, meaning the HSQLDB row cache is under constant eviction pressure — it cannot hold the working set in memory.
4.4 Foreign Fee Manager — EdDSA in a hot path
Foreign Fee Manager-1 (277 Java execution samples + 12 native samples) is burning CPU on:
Crypto.verify()
→ Ed25519.implVerify()
→ Ed25519.scalarMultStraus128Var()
→ Wnaf.getSignedVar()
Ed25519 signature verification is expensive on aarch64 without hardware acceleration. If processForeignFeesImportQueue() is verifying signatures in a tight scheduler loop, it should batch or throttle.
5. Object Allocation Profile
5.1 Allocation sample type breakdown (85,503 total samples)
| Type | Samples | % | Primary source |
|---|---|---|---|
byte[] |
34,837 | 40.7% | Peer.readChannel() (2 MB buffers) + HSQLDB binary reads |
NodeAVLDisk |
6,278 | 7.3% | HSQLDB index node reads |
Peer$PreSerializedMessageWrapper |
5,013 | 5.9% | Outbound peer message serialization |
AQS$ConditionNode |
4,505 | 5.3% | Lock condition queue churn |
Object[] |
3,974 | 4.6% | General |
int[] |
3,554 | 4.2% | HSQLDB / general |
RowAVLDisk |
1,890 | 2.2% | HSQLDB row reads |
BinaryData |
1,687 | 2.0% | HSQLDB binary column data |
Long |
1,510 | 1.8% | Autoboxing |
5.2 Promotion to Old Gen (PromoteObjectInNewPLAB)
- 66,987 total promote events; 60,064 (89.7%) tenured = true.
- Nearly 90% of promoted objects survive to Old Gen, confirming the working set is too large for Young Gen to absorb.
plabSize = 1.0 MBappears 4,989 times — large 1 MB PLAB chunks being allocated in Old Gen forbyte[],BinaryData, andNodeAVLDiskpromotions.
5.3 Old Object Sample (741 events — long-lived objects)
All 741 OldObjectSample events record object = [ (arrays), consistent with the 2 MB byte[] buffers accumulating in Old Gen between concurrent GC cycles.
6. Memory State
| Pool | Used | Committed | Max |
|---|---|---|---|
| G1 Eden Space | 64 MB | 360 MB | dynamic |
| G1 Old Gen | 1.19 GB | 1.54 GB | 3.64 GB |
| G1 Survivor Space | 44 MB | 44 MB | dynamic |
| Metaspace | 76 MB | — | unlimited |
| CodeHeap (all) | 28.8 MB | — | — |
- Old Gen grew from 897 MB → 1.19 GB (+300 MB) in ~3 minutes of the live observation window.
- G1 Eden has 360 MB committed but only 64 MB used — G1 is not utilising Eden efficiently because most large allocations skip it entirely (humongous path).
- Host free physical memory: 74.7 MB of 7.28 GB. JVM committed virtual memory (7.52 GB) exceeds total RAM — the OS is paging.
7. Thread State Snapshot
| State | Count |
|---|---|
| RUNNABLE | 20 |
| WAITING | 124 |
| TIMED_WAITING | 98 |
| Live total | 242 (peak 270, 860 started) |
- No deadlocks.
- No blocked threads.
- The high WAITING + TIMED_WAITING count with non-trivial CPU burn (seen in the JMX sample) is explained by threads that loop through work, block briefly on a lock or park, and immediately resume — appearing "waiting" in a point-in-time snapshot but consuming real CPU over time.
8. Summary of Findings
| # | Finding | Severity |
|---|---|---|
| 1 | Peer.readChannel() allocates a 2 MB ByteBuffer per message cycle; humongous threshold is 1 MB → 97.7% of GCs triggered by this |
Critical |
| 2 | Max GC pause of 5.93 seconds; 8% of runtime in STW pauses | Critical |
| 3 | Old Gen growing ~100 MB/min; Full GC imminent if growth continues | High |
| 4 | Physical RAM exhausted (74 MB free); JVM committed memory exceeds RAM → OS paging | High |
| 5 | Arbitrary Data subsystem runs continuous full AVL index scans on disk-backed HSQLDB | High |
| 6 | HSQLDB DataFileCache under constant WriteLock contention across 4+ threads |
Medium |
| 7 | Foreign Fee Manager verifies Ed25519 signatures in a tight scheduler loop |
Medium |
| 8 | 89.7% of Young Gen promotions go directly to Old Gen — Young Gen is not filtering effectively | Medium |
9. Recommended Fixes
Fix 1 — Eliminate humongous ByteBuffer allocations (Critical)
Option A (JVM flag — immediate mitigation, no code change):
Add -XX:G1HeapRegionSize=4m to the JVM launch flags. This doubles the region size, moving the humongous threshold from 1 MB to 2 MB. The 2,097,161-byte buffers drop below the threshold and allocate normally in Eden.
Option B (code fix — correct solution):
Do not set byteBuffer = null after each message. Keep the buffer allocated for the lifetime of the connection and clear() / compact() it instead. This means one 2 MB allocation per peer connection (not per message), and since it stays in Old Gen only once, it stops triggering repeated GC cycles.
Option C (deeper fix): Allocate a small initial buffer and grow it only when a large message is detected from the header length. Most messages are small; only block messages need 2 MB.
Fix 2 — Reduce HSQLDB full-scan frequency (High)
The Arbitrary Data threads run full IndexAVL scans on every tick. Determine whether the queries driving Arbitrary Data Manager, Arbitrary Index Cache Timer Task, and Arbitrary Data Cleanup Manager have appropriate indexes, and whether their scheduling interval can be increased.
Fix 3 — Throttle Ed25519 verification in Foreign Fee Manager (Medium)
Batch signature verification or add a rate limit / back-pressure to processForeignFeesImportQueue() so it does not monopolize CPU on each scheduler tick.
10. Implementation Status
| # | Finding | Severity | Status | Commit |
|---|---|---|---|---|
| 1 | Peer.readChannel() 2 MB humongous ByteBuffer per message cycle |
Critical | ✅ Fixed | b0895f02 |
| 2 | Max GC pause 5.93 s; 8% of runtime in STW pauses | Critical | ✅ Resolved by #1 | — |
| 3 | Old Gen growing ~100 MB/min; Full GC imminent | High | ✅ Resolved by #1 | — |
| 4 | Physical RAM exhausted; JVM committed memory exceeds RAM → OS paging | High | ✅ Fixed | 4134edb6 |
| 5 | Arbitrary Data subsystem runs continuous full AVL index scans on disk-backed HSQLDB | High | ✅ Fixed | e3de6944 |
| 6 | HSQLDB DataFileCache under constant WriteLock contention across 4+ threads |
Medium | ✅ Resolved by #5 | — |
| 7 | Foreign Fee Manager verifies Ed25519 signatures in a tight scheduler loop |
Medium | ✅ Fixed | 2ef1e3f8 |
| 8 | 89.7% of Young Gen promotions go directly to Old Gen | Medium | ✅ Resolved by #1 | — |