mirror of
https://github.com/Qortal/qortal.git
synced 2026-06-25 23:39:23 +00:00
276 lines
12 KiB
Markdown
276 lines
12 KiB
Markdown
# JFR Analysis — snap-140558.jfr
|
||
|
||
**Host:** masked info
|
||
**JVM:** OpenJDK 17.0.19, G1GC, max heap 3.64 GB
|
||
**Recording:** 2026-05-31 11:59:35 UTC, duration **384 seconds**, 4 chunks, 22 MB
|
||
|
||
---
|
||
|
||
## 1. CPU Load During Recording
|
||
|
||
| Metric | Observed range |
|
||
|---|---|
|
||
| Machine total CPU | 55 – 96% |
|
||
| JVM user CPU | 20 – 50% |
|
||
| JVM system CPU | 5 – 15% |
|
||
| System load avg (at snapshot) | **6.11 on 4 cores** (152% overloaded) |
|
||
|
||
The machine is CPU-saturated throughout the recording. JVM user CPU is consistently high and system CPU (kernel/IO) adds another significant fraction.
|
||
|
||
---
|
||
|
||
## 2. GC — The Primary Problem
|
||
|
||
### 2.1 Overview
|
||
|
||
| Metric | Value |
|
||
|---|---|
|
||
| Total GC events | **485** in 384 s (~1.26 GC/sec) |
|
||
| Young GC events | 248 |
|
||
| Old (concurrent) GC events | 237 |
|
||
| Total stop-the-world pause time | **30,514 ms** (8% of runtime) |
|
||
| Average pause | **62.9 ms** |
|
||
| **Maximum pause** | **5,930 ms** (5.93 seconds) |
|
||
|
||
8% of application time is spent paused for GC. The 5.93 s max pause is catastrophic for a peer-to-peer node — it drops connections and misses block propagation windows.
|
||
|
||
### 2.2 GC Cause Breakdown
|
||
|
||
| Cause | Count | % |
|
||
|---|---|---|
|
||
| **G1 Humongous Allocation** | **474** | **97.7%** |
|
||
| G1 Preventive Collection | 9 | 1.9% |
|
||
| G1 Evacuation Pause | 2 | 0.4% |
|
||
|
||
**97.7% of all GC events are triggered by humongous object allocations.** This is the smoking gun.
|
||
|
||
### 2.3 What Is a Humongous Allocation?
|
||
|
||
G1GC divides the heap into equal-sized regions. Any single allocation ≥ 50% of the region size is a "humongous object" and is handled specially:
|
||
|
||
- It is allocated directly in Old Gen (bypasses Eden entirely).
|
||
- G1 triggers a Young GC immediately to reclaim Old Gen space when humongous objects accumulate.
|
||
- Humongous objects fragment the heap and are expensive to free.
|
||
|
||
For a 3.64 GB max heap, G1 uses **2 MB regions** → humongous threshold = **1 MB**.
|
||
|
||
---
|
||
|
||
## 3. Root Cause: `Peer.readChannel()` Allocates 2 MB Buffers
|
||
|
||
### 3.1 The allocation
|
||
|
||
`src/main/java/org/qortal/network/Peer.java`, line 783:
|
||
|
||
```java
|
||
if (this.byteBuffer == null) {
|
||
this.byteBuffer = ByteBuffer.allocate(Network.getInstance().getMaxMessageSize());
|
||
}
|
||
```
|
||
|
||
`getMaxMessageSize()` is computed once in `Network` as:
|
||
|
||
```java
|
||
maxMessageSize = 4 + 1 + 4 + BlockChain.getInstance().getMaxBlockSize();
|
||
```
|
||
|
||
`maxBlockSize` is set to **2,097,152** bytes in `blockchain.json`:
|
||
|
||
```json
|
||
"maxBlockSize": 2097152
|
||
```
|
||
|
||
So every `ByteBuffer.allocate` call here allocates **2,097,161 bytes ≈ 2.000009 MB** — just above the 2 MB region boundary, and **double** the 1 MB humongous threshold.
|
||
|
||
### 3.2 Why this triggers constant GCs
|
||
|
||
- The buffer starts as `null` (line 720) and is freed (set to `null`) each time the buffer is empty after a complete message is consumed (line 847).
|
||
- Every peer connection that reads even a single byte triggers a fresh 2 MB humongous allocation.
|
||
- With 242+ live threads and active peer networking, this allocation fires continuously.
|
||
- The JFR `ObjectAllocationSample` records **15,985 samples** of `byte[]` originating from `Peer.readChannel()` line 783, with a sampled weight of **2.0 MB each** — the single largest allocation source in the entire recording (40.7% of all allocation samples are `byte[]`).
|
||
|
||
### 3.3 Impact chain
|
||
|
||
```
|
||
Peer connects / sends data
|
||
→ byteBuffer == null → allocate 2,097,161 bytes
|
||
→ humongous object → lands in Old Gen
|
||
→ Old Gen fills → G1 Young GC fires (cause: "G1 Humongous Allocation")
|
||
→ STW pause 63 ms avg, up to 5,930 ms max
|
||
→ application stalls, connections time out
|
||
```
|
||
|
||
---
|
||
|
||
## 4. CPU Hotspots
|
||
|
||
### 4.1 Java execution samples (8,366 total)
|
||
|
||
| Thread | Samples | % |
|
||
|---|---|---|
|
||
| `Arbitrary Data` group (all threads) | ~4,172 | **49.9%** |
|
||
| `Synchronizer` | 995 | 11.9% |
|
||
| `pool-26-thread-1` | 562 | 6.7% |
|
||
| `AT States trimmer` / `pool-*` | 486 | 5.8% |
|
||
| `BlockMinter` | 412 | 4.9% |
|
||
| `pool-16-thread-1` | 382 | 4.6% |
|
||
| `Foreign Fee Manager-1` | 277 | 3.3% |
|
||
|
||
### 4.2 Native method samples (17,519 total)
|
||
|
||
| Thread | Samples | % | Note |
|
||
|---|---|---|---|
|
||
| `RMI TCP Connection` threads | 2,956 | 16.9% | JMX/diagnostic traffic |
|
||
| `HttpClient-1-SelectorManager` | 1,478 | 8.4% | |
|
||
| `qtp406511188-272/273/274` | ~4,431 | 25.3% | Jetty HTTP threads (SSL) |
|
||
| `grpc-nio-worker-ELG-1-2/3/4` | ~4,431 | 25.3% | gRPC NIO workers |
|
||
| `Network-IO` | 1,320 | 7.5% | |
|
||
| `NetworkData-IO` | 1,247 | 7.1% | |
|
||
| `Arbitrary Data` threads | 570 | 3.3% | |
|
||
|
||
The Jetty + gRPC NIO workers dominate native samples, indicating heavy network I/O in kernel space (epoll/select syscalls).
|
||
|
||
### 4.3 Arbitrary Data subsystem — what it is actually doing
|
||
|
||
The execution samples for all Arbitrary Data threads point to the same HSQLDB call chain:
|
||
|
||
```
|
||
QuerySpecification.buildResult()
|
||
→ RangeVariable$RangeIteratorMain.next()
|
||
→ IndexAVL$IndexRowIterator.getNextRow()
|
||
→ IndexAVL.next()
|
||
→ NodeAVLDisk.getLeft/getRight/getParent()
|
||
→ RowStoreAVLDisk.get()
|
||
→ DataFileCache.getFromFile() ← disk read under WriteLock
|
||
→ DataFileCache.get()
|
||
```
|
||
|
||
This is a **full AVL index scan on a disk-backed HSQLDB table**, repeated continuously by multiple threads (`Arbitrary Data Manager`, `Arbitrary Index Cache Timer Task`, `Arbitrary Data Cleanup Manager`, `Arbitrary Data Cache Manager`). The `ReentrantReadWriteLock$WriteLock.lock()` on `DataFileCache` causes cross-thread contention on every disk read.
|
||
|
||
Additionally, `Cache.cleanUp()` / `Cache.preparePut()` appear in samples, meaning the HSQLDB row cache is under constant eviction pressure — it cannot hold the working set in memory.
|
||
|
||
### 4.4 Foreign Fee Manager — EdDSA in a hot path
|
||
|
||
`Foreign Fee Manager-1` (277 Java execution samples + 12 native samples) is burning CPU on:
|
||
|
||
```
|
||
Crypto.verify()
|
||
→ Ed25519.implVerify()
|
||
→ Ed25519.scalarMultStraus128Var()
|
||
→ Wnaf.getSignedVar()
|
||
```
|
||
|
||
Ed25519 signature verification is expensive on aarch64 without hardware acceleration. If `processForeignFeesImportQueue()` is verifying signatures in a tight scheduler loop, it should batch or throttle.
|
||
|
||
---
|
||
|
||
## 5. Object Allocation Profile
|
||
|
||
### 5.1 Allocation sample type breakdown (85,503 total samples)
|
||
|
||
| Type | Samples | % | Primary source |
|
||
|---|---|---|---|
|
||
| `byte[]` | **34,837** | **40.7%** | `Peer.readChannel()` (2 MB buffers) + HSQLDB binary reads |
|
||
| `NodeAVLDisk` | 6,278 | 7.3% | HSQLDB index node reads |
|
||
| `Peer$PreSerializedMessageWrapper` | 5,013 | 5.9% | Outbound peer message serialization |
|
||
| `AQS$ConditionNode` | 4,505 | 5.3% | Lock condition queue churn |
|
||
| `Object[]` | 3,974 | 4.6% | General |
|
||
| `int[]` | 3,554 | 4.2% | HSQLDB / general |
|
||
| `RowAVLDisk` | 1,890 | 2.2% | HSQLDB row reads |
|
||
| `BinaryData` | 1,687 | 2.0% | HSQLDB binary column data |
|
||
| `Long` | 1,510 | 1.8% | Autoboxing |
|
||
|
||
### 5.2 Promotion to Old Gen (PromoteObjectInNewPLAB)
|
||
|
||
- **66,987** total promote events; **60,064 (89.7%) tenured = true**.
|
||
- Nearly 90% of promoted objects survive to Old Gen, confirming the working set is too large for Young Gen to absorb.
|
||
- `plabSize = 1.0 MB` appears 4,989 times — large 1 MB PLAB chunks being allocated in Old Gen for `byte[]`, `BinaryData`, and `NodeAVLDisk` promotions.
|
||
|
||
### 5.3 Old Object Sample (741 events — long-lived objects)
|
||
|
||
All 741 OldObjectSample events record `object = [` (arrays), consistent with the 2 MB `byte[]` buffers accumulating in Old Gen between concurrent GC cycles.
|
||
|
||
---
|
||
|
||
## 6. Memory State
|
||
|
||
| Pool | Used | Committed | Max |
|
||
|---|---|---|---|
|
||
| G1 Eden Space | 64 MB | 360 MB | dynamic |
|
||
| G1 Old Gen | 1.19 GB | 1.54 GB | 3.64 GB |
|
||
| G1 Survivor Space | 44 MB | 44 MB | dynamic |
|
||
| Metaspace | 76 MB | — | unlimited |
|
||
| CodeHeap (all) | 28.8 MB | — | — |
|
||
|
||
- **Old Gen grew from 897 MB → 1.19 GB** (+300 MB) in ~3 minutes of the live observation window.
|
||
- G1 Eden has 360 MB committed but only 64 MB used — G1 is not utilising Eden efficiently because most large allocations skip it entirely (humongous path).
|
||
- Host free physical memory: **74.7 MB of 7.28 GB**. JVM committed virtual memory (7.52 GB) exceeds total RAM — the OS is paging.
|
||
|
||
---
|
||
|
||
## 7. Thread State Snapshot
|
||
|
||
| State | Count |
|
||
|---|---|
|
||
| RUNNABLE | 20 |
|
||
| WAITING | 124 |
|
||
| TIMED_WAITING | 98 |
|
||
| Live total | 242 (peak 270, 860 started) |
|
||
|
||
- No deadlocks.
|
||
- No blocked threads.
|
||
- The high WAITING + TIMED_WAITING count with non-trivial CPU burn (seen in the JMX sample) is explained by threads that loop through work, block briefly on a lock or park, and immediately resume — appearing "waiting" in a point-in-time snapshot but consuming real CPU over time.
|
||
|
||
---
|
||
|
||
## 8. Summary of Findings
|
||
|
||
| # | Finding | Severity |
|
||
|---|---|---|
|
||
| 1 | `Peer.readChannel()` allocates a 2 MB `ByteBuffer` per message cycle; humongous threshold is 1 MB → 97.7% of GCs triggered by this | **Critical** |
|
||
| 2 | Max GC pause of **5.93 seconds**; 8% of runtime in STW pauses | **Critical** |
|
||
| 3 | Old Gen growing ~100 MB/min; Full GC imminent if growth continues | **High** |
|
||
| 4 | Physical RAM exhausted (74 MB free); JVM committed memory exceeds RAM → OS paging | **High** |
|
||
| 5 | Arbitrary Data subsystem runs continuous full AVL index scans on disk-backed HSQLDB | **High** |
|
||
| 6 | HSQLDB `DataFileCache` under constant `WriteLock` contention across 4+ threads | **Medium** |
|
||
| 7 | `Foreign Fee Manager` verifies Ed25519 signatures in a tight scheduler loop | **Medium** |
|
||
| 8 | 89.7% of Young Gen promotions go directly to Old Gen — Young Gen is not filtering effectively | **Medium** |
|
||
|
||
---
|
||
|
||
## 9. Recommended Fixes
|
||
|
||
### Fix 1 — Eliminate humongous `ByteBuffer` allocations (Critical)
|
||
|
||
**Option A (JVM flag — immediate mitigation, no code change):**
|
||
Add `-XX:G1HeapRegionSize=4m` to the JVM launch flags. This doubles the region size, moving the humongous threshold from 1 MB to 2 MB. The 2,097,161-byte buffers drop below the threshold and allocate normally in Eden.
|
||
|
||
**Option B (code fix — correct solution):**
|
||
Do not set `byteBuffer = null` after each message. Keep the buffer allocated for the lifetime of the connection and `clear()` / `compact()` it instead. This means one 2 MB allocation per peer connection (not per message), and since it stays in Old Gen only once, it stops triggering repeated GC cycles.
|
||
|
||
**Option C (deeper fix):**
|
||
Allocate a small initial buffer and grow it only when a large message is detected from the header length. Most messages are small; only block messages need 2 MB.
|
||
|
||
### Fix 2 — Reduce HSQLDB full-scan frequency (High)
|
||
|
||
The Arbitrary Data threads run full `IndexAVL` scans on every tick. Determine whether the queries driving `Arbitrary Data Manager`, `Arbitrary Index Cache Timer Task`, and `Arbitrary Data Cleanup Manager` have appropriate indexes, and whether their scheduling interval can be increased.
|
||
|
||
### Fix 3 — Throttle Ed25519 verification in Foreign Fee Manager (Medium)
|
||
|
||
Batch signature verification or add a rate limit / back-pressure to `processForeignFeesImportQueue()` so it does not monopolize CPU on each scheduler tick.
|
||
|
||
---
|
||
|
||
## 10. Implementation Status
|
||
|
||
| # | Finding | Severity | Status | Commit |
|
||
|---|---|---|---|---|
|
||
| 1 | `Peer.readChannel()` 2 MB humongous `ByteBuffer` per message cycle | **Critical** | ✅ Fixed | `b0895f02` |
|
||
| 2 | Max GC pause 5.93 s; 8% of runtime in STW pauses | **Critical** | ✅ Resolved by #1 | — |
|
||
| 3 | Old Gen growing ~100 MB/min; Full GC imminent | **High** | ✅ Resolved by #1 | — |
|
||
| 4 | Physical RAM exhausted; JVM committed memory exceeds RAM → OS paging | **High** | ✅ Fixed | `4134edb6` |
|
||
| 5 | Arbitrary Data subsystem runs continuous full AVL index scans on disk-backed HSQLDB | **High** | ✅ Fixed | `e3de6944` |
|
||
| 6 | HSQLDB `DataFileCache` under constant `WriteLock` contention across 4+ threads | **Medium** | ✅ Resolved by #5 | — |
|
||
| 7 | `Foreign Fee Manager` verifies Ed25519 signatures in a tight scheduler loop | **Medium** | ✅ Fixed | `2ef1e3f8` |
|
||
| 8 | 89.7% of Young Gen promotions go directly to Old Gen | **Medium** | ✅ Resolved by #1 | — |
|