Qortal/qortal

Fork 1

mirror of https://github.com/Qortal/qortal.git synced 2026-06-24 06:49:23 +00:00

Files

T

Nic ae5b2f2cee Final implementation status

2026-05-31 14:59:14 +02:00

12 KiB

Raw Permalink Blame History

JFR Analysis — snap-140558.jfr

Host: masked info JVM: OpenJDK 17.0.19, G1GC, max heap 3.64 GB Recording: 2026-05-31 11:59:35 UTC, duration 384 seconds, 4 chunks, 22 MB

1. CPU Load During Recording

Metric	Observed range
Machine total CPU	55 – 96%
JVM user CPU	20 – 50%
JVM system CPU	5 – 15%
System load avg (at snapshot)	6.11 on 4 cores (152% overloaded)

The machine is CPU-saturated throughout the recording. JVM user CPU is consistently high and system CPU (kernel/IO) adds another significant fraction.

2. GC — The Primary Problem

2.1 Overview

Metric	Value
Total GC events	485 in 384 s (~1.26 GC/sec)
Young GC events	248
Old (concurrent) GC events	237
Total stop-the-world pause time	30,514 ms (8% of runtime)
Average pause	62.9 ms
Maximum pause	5,930 ms (5.93 seconds)

8% of application time is spent paused for GC. The 5.93 s max pause is catastrophic for a peer-to-peer node — it drops connections and misses block propagation windows.

2.2 GC Cause Breakdown

Cause	Count	%
G1 Humongous Allocation	474	97.7%
G1 Preventive Collection	9	1.9%
G1 Evacuation Pause	2	0.4%

97.7% of all GC events are triggered by humongous object allocations. This is the smoking gun.

2.3 What Is a Humongous Allocation?

G1GC divides the heap into equal-sized regions. Any single allocation ≥ 50% of the region size is a "humongous object" and is handled specially:

It is allocated directly in Old Gen (bypasses Eden entirely).
G1 triggers a Young GC immediately to reclaim Old Gen space when humongous objects accumulate.
Humongous objects fragment the heap and are expensive to free.

For a 3.64 GB max heap, G1 uses 2 MB regions → humongous threshold = 1 MB.

3. Root Cause: `Peer.readChannel()` Allocates 2 MB Buffers

3.1 The allocation

src/main/java/org/qortal/network/Peer.java, line 783:

if (this.byteBuffer == null) {
    this.byteBuffer = ByteBuffer.allocate(Network.getInstance().getMaxMessageSize());
}

getMaxMessageSize() is computed once in Network as:

maxMessageSize = 4 + 1 + 4 + BlockChain.getInstance().getMaxBlockSize();

maxBlockSize is set to 2,097,152 bytes in blockchain.json:

"maxBlockSize": 2097152

So every ByteBuffer.allocate call here allocates 2,097,161 bytes ≈ 2.000009 MB — just above the 2 MB region boundary, and double the 1 MB humongous threshold.

3.2 Why this triggers constant GCs

The buffer starts as null (line 720) and is freed (set to null) each time the buffer is empty after a complete message is consumed (line 847).
Every peer connection that reads even a single byte triggers a fresh 2 MB humongous allocation.
With 242+ live threads and active peer networking, this allocation fires continuously.
The JFR ObjectAllocationSample records 15,985 samples of byte[] originating from Peer.readChannel() line 783, with a sampled weight of 2.0 MB each — the single largest allocation source in the entire recording (40.7% of all allocation samples are byte[]).

3.3 Impact chain

Peer connects / sends data
  → byteBuffer == null → allocate 2,097,161 bytes
    → humongous object → lands in Old Gen
      → Old Gen fills → G1 Young GC fires (cause: "G1 Humongous Allocation")
        → STW pause 63 ms avg, up to 5,930 ms max
          → application stalls, connections time out

4. CPU Hotspots

4.1 Java execution samples (8,366 total)

Thread	Samples	%
`Arbitrary Data` group (all threads)	~4,172	49.9%
`Synchronizer`	995	11.9%
`pool-26-thread-1`	562	6.7%
`AT States trimmer` / `pool-*`	486	5.8%
`BlockMinter`	412	4.9%
`pool-16-thread-1`	382	4.6%
`Foreign Fee Manager-1`	277	3.3%

4.2 Native method samples (17,519 total)

Thread	Samples	%	Note
`RMI TCP Connection` threads	2,956	16.9%	JMX/diagnostic traffic
`HttpClient-1-SelectorManager`	1,478	8.4%
`qtp406511188-272/273/274`	~4,431	25.3%	Jetty HTTP threads (SSL)
`grpc-nio-worker-ELG-1-2/3/4`	~4,431	25.3%	gRPC NIO workers
`Network-IO`	1,320	7.5%
`NetworkData-IO`	1,247	7.1%
`Arbitrary Data` threads	570	3.3%

The Jetty + gRPC NIO workers dominate native samples, indicating heavy network I/O in kernel space (epoll/select syscalls).

4.3 Arbitrary Data subsystem — what it is actually doing

The execution samples for all Arbitrary Data threads point to the same HSQLDB call chain:

QuerySpecification.buildResult()
  → RangeVariable$RangeIteratorMain.next()
    → IndexAVL$IndexRowIterator.getNextRow()
      → IndexAVL.next()
        → NodeAVLDisk.getLeft/getRight/getParent()
          → RowStoreAVLDisk.get()
            → DataFileCache.getFromFile()   ← disk read under WriteLock
              → DataFileCache.get()

This is a full AVL index scan on a disk-backed HSQLDB table, repeated continuously by multiple threads (Arbitrary Data Manager, Arbitrary Index Cache Timer Task, Arbitrary Data Cleanup Manager, Arbitrary Data Cache Manager). The ReentrantReadWriteLock$WriteLock.lock() on DataFileCache causes cross-thread contention on every disk read.

Additionally, Cache.cleanUp() / Cache.preparePut() appear in samples, meaning the HSQLDB row cache is under constant eviction pressure — it cannot hold the working set in memory.

4.4 Foreign Fee Manager — EdDSA in a hot path

Foreign Fee Manager-1 (277 Java execution samples + 12 native samples) is burning CPU on:

Crypto.verify()
  → Ed25519.implVerify()
    → Ed25519.scalarMultStraus128Var()
      → Wnaf.getSignedVar()

Ed25519 signature verification is expensive on aarch64 without hardware acceleration. If processForeignFeesImportQueue() is verifying signatures in a tight scheduler loop, it should batch or throttle.

5. Object Allocation Profile

5.1 Allocation sample type breakdown (85,503 total samples)

Type	Samples	%	Primary source
`byte[]`	34,837	40.7%	`Peer.readChannel()` (2 MB buffers) + HSQLDB binary reads
`NodeAVLDisk`	6,278	7.3%	HSQLDB index node reads
`Peer$PreSerializedMessageWrapper`	5,013	5.9%	Outbound peer message serialization
`AQS$ConditionNode`	4,505	5.3%	Lock condition queue churn
`Object[]`	3,974	4.6%	General
`int[]`	3,554	4.2%	HSQLDB / general
`RowAVLDisk`	1,890	2.2%	HSQLDB row reads
`BinaryData`	1,687	2.0%	HSQLDB binary column data
`Long`	1,510	1.8%	Autoboxing

5.2 Promotion to Old Gen (PromoteObjectInNewPLAB)

66,987 total promote events; 60,064 (89.7%) tenured = true.
Nearly 90% of promoted objects survive to Old Gen, confirming the working set is too large for Young Gen to absorb.
plabSize = 1.0 MB appears 4,989 times — large 1 MB PLAB chunks being allocated in Old Gen for byte[], BinaryData, and NodeAVLDisk promotions.

5.3 Old Object Sample (741 events — long-lived objects)

All 741 OldObjectSample events record object = [ (arrays), consistent with the 2 MB byte[] buffers accumulating in Old Gen between concurrent GC cycles.

6. Memory State

Pool	Used	Committed	Max
G1 Eden Space	64 MB	360 MB	dynamic
G1 Old Gen	1.19 GB	1.54 GB	3.64 GB
G1 Survivor Space	44 MB	44 MB	dynamic
Metaspace	76 MB	—	unlimited
CodeHeap (all)	28.8 MB	—	—

Old Gen grew from 897 MB → 1.19 GB (+300 MB) in ~3 minutes of the live observation window.
G1 Eden has 360 MB committed but only 64 MB used — G1 is not utilising Eden efficiently because most large allocations skip it entirely (humongous path).
Host free physical memory: 74.7 MB of 7.28 GB. JVM committed virtual memory (7.52 GB) exceeds total RAM — the OS is paging.

7. Thread State Snapshot

State	Count
RUNNABLE	20
WAITING	124
TIMED_WAITING	98
Live total	242 (peak 270, 860 started)

No deadlocks.
No blocked threads.
The high WAITING + TIMED_WAITING count with non-trivial CPU burn (seen in the JMX sample) is explained by threads that loop through work, block briefly on a lock or park, and immediately resume — appearing "waiting" in a point-in-time snapshot but consuming real CPU over time.

8. Summary of Findings

#	Finding	Severity
1	`Peer.readChannel()` allocates a 2 MB `ByteBuffer` per message cycle; humongous threshold is 1 MB → 97.7% of GCs triggered by this	Critical
2	Max GC pause of 5.93 seconds; 8% of runtime in STW pauses	Critical
3	Old Gen growing ~100 MB/min; Full GC imminent if growth continues	High
4	Physical RAM exhausted (74 MB free); JVM committed memory exceeds RAM → OS paging	High
5	Arbitrary Data subsystem runs continuous full AVL index scans on disk-backed HSQLDB	High
6	HSQLDB `DataFileCache` under constant `WriteLock` contention across 4+ threads	Medium
7	`Foreign Fee Manager` verifies Ed25519 signatures in a tight scheduler loop	Medium
8	89.7% of Young Gen promotions go directly to Old Gen — Young Gen is not filtering effectively	Medium

9. Recommended Fixes

Fix 1 — Eliminate humongous `ByteBuffer` allocations (Critical)

Option A (JVM flag — immediate mitigation, no code change): Add -XX:G1HeapRegionSize=4m to the JVM launch flags. This doubles the region size, moving the humongous threshold from 1 MB to 2 MB. The 2,097,161-byte buffers drop below the threshold and allocate normally in Eden.

Option B (code fix — correct solution): Do not set byteBuffer = null after each message. Keep the buffer allocated for the lifetime of the connection and clear() / compact() it instead. This means one 2 MB allocation per peer connection (not per message), and since it stays in Old Gen only once, it stops triggering repeated GC cycles.

Option C (deeper fix): Allocate a small initial buffer and grow it only when a large message is detected from the header length. Most messages are small; only block messages need 2 MB.

Fix 2 — Reduce HSQLDB full-scan frequency (High)

The Arbitrary Data threads run full IndexAVL scans on every tick. Determine whether the queries driving Arbitrary Data Manager, Arbitrary Index Cache Timer Task, and Arbitrary Data Cleanup Manager have appropriate indexes, and whether their scheduling interval can be increased.

Fix 3 — Throttle Ed25519 verification in Foreign Fee Manager (Medium)

Batch signature verification or add a rate limit / back-pressure to processForeignFeesImportQueue() so it does not monopolize CPU on each scheduler tick.

10. Implementation Status

#	Finding	Severity	Status	Commit
1	`Peer.readChannel()` 2 MB humongous `ByteBuffer` per message cycle	Critical	✅ Fixed	`b0895f02`
2	Max GC pause 5.93 s; 8% of runtime in STW pauses	Critical	✅ Resolved by #1	—
3	Old Gen growing ~100 MB/min; Full GC imminent	High	✅ Resolved by #1	—
4	Physical RAM exhausted; JVM committed memory exceeds RAM → OS paging	High	✅ Fixed	`4134edb6`
5	Arbitrary Data subsystem runs continuous full AVL index scans on disk-backed HSQLDB	High	✅ Fixed	`e3de6944`
6	HSQLDB `DataFileCache` under constant `WriteLock` contention across 4+ threads	Medium	✅ Resolved by #5	—
7	`Foreign Fee Manager` verifies Ed25519 signatures in a tight scheduler loop	Medium	✅ Fixed	`2ef1e3f8`
8	89.7% of Young Gen promotions go directly to Old Gen	Medium	✅ Resolved by #1	—

12 KiB Raw Permalink Blame History Unescape Escape