12 KiB
GC Log + JFR Analysis — Full Re-Analysis
Environment: Java 17.0.19, G1GC, 4 CPUs, 7.4GB system RAM, heap 1536M–2984M
GC log window: ~28.9 minutes (1731s), 305 GC events
JFR window: 1327s (22 min), 6 chunks, start 17:36:20
1. GC Summary (gc.log — 305+ collections)
| Phase | Time range | GC count | Notable |
|---|---|---|---|
| Warmup | 0–22s | GC0–GC2 | Heap 84M→23M; first concurrent mark 43ms |
| Ramp | 22–240s | GC3–GC82 | MMU violations begin; 82 Mixed GCs clear old gen; mark cycles 738–910ms |
| Mid | 240–870s | GC83–GC183 | Long concurrent mark (2224ms at GC125); heap grows to 2232M max |
| Heavy mixed | 870–1180s | GC184–GC247 | Mixed GC pauses escalate to 395ms; concurrent mark 2395ms→2738ms |
| Terminal | 1180–1731s | GC248–GC305 | Only Young GCs; heap monotonically fills; old gen 268 regions = 2144MB; heap after GC reaches 2310/2984M |
Key numbers:
- 77 MMU violations (all at 151ms/150ms target) across the entire log — not 10+
- Worst pauses: 395ms (GC246 mixed), 361ms (GC190 mixed), 350ms (GC94 young-normal)
- Concurrent mark times growing: 43ms → 738ms → 910ms → 2224ms → 2395ms → 2738ms — each cycle longer because the live object graph keeps expanding
- Old gen grows monotonically after mixed GCs finish: 0 → 268 regions. Young GC adds 2–4 old regions per collection with nothing ever reclaiming them.
- Heap near saturation at log end: after GC305, heap = 2413M→2310M in a 2984M capacity. Only ~674MB headroom. OOM within minutes of the log ending unless a concurrent mark + mixed GC cycle fires and succeeds.
- Container RSS: JFR ContainerMemoryUsage shows process consuming 3.0GB→4.0GB over the JFR window — far above the 2984M heap max, meaning substantial off-heap (direct buffers, native, metaspace).
2. CPU Hot Spots (ExecutionSample, 25036 samples — all RUNNABLE)
| Rank | Method | Samples | % | Change vs last |
|---|---|---|---|---|
| 1 | hsqldb.map.BaseHashMap.getObjectLookup |
2060 | 8.2% | ≈ same |
| 2 | hsqldb.lib.ObjectComparator.equals |
1752 | 7.0% | +1.7pp ↑ |
| 3 | qortal.utils.Base58.divMod58 |
1744 | 7.0% | +4pp ↑↑ — more than doubled |
| 4 | hsqldb.persist.Cache.cleanUp |
1171 | 4.7% | +1.1pp ↑ |
| 5 | hsqldb.map.BaseHashMap.remove |
1037 | 4.1% | +1.7pp ↑ |
| 6 | AQS.releaseShared |
835 | 3.3% | ≈ same |
| 7 | AQS.acquireShared |
787 | 3.1% | ≈ same |
| 8 | ReentrantReadWriteLock.tryAcquireShared |
478 | 1.9% | ≈ same |
| 9 | hsqldb.map.ValuePoolHashMap.getOrAddInteger |
434 | 1.7% | new |
Base58.divMod58 is now #3 at 7% — it's grown proportionally as other work stays the same, meaning the number of address encode/decode calls is increasing with chain growth. This is the fastest-growing hot spot.
HSQLDB Cache.cleanUp remaining at #4 (now 4.7%) confirms the cache is consistently undersized for the working set.
3. Lock Contention (JavaMonitorEnter, 1175 events)
| Monitor class | Events | Total blocked | Max single wait | Root cause |
|---|---|---|---|---|
NTPUDPClient |
593 | 1,516,245 ms (25.3 min) | 8,170 ms | Shared single instance across all NTP pool threads |
Collections$SynchronizedMap |
347 | 392,007 ms (6.5 min) | 6,100 ms | ArbitraryDataBuildManager build queue map |
HSQLDBRepositoryFactory |
2 | 62,900 ms | 32,400 ms | AT state pruner/trimmer holding factory lock |
EPollSelectorImpl |
173 | 5,067 ms | 144 ms | Network selector contention |
ArrayList |
30 | 3,969 ms | 847 ms | Various |
NTPUDPClient remains the worst single contention point. 25 cumulative minutes of thread blocking in a 22-minute recording — effectively, NTP threads spend more time blocked than running.
New finding — ArbitraryDataBuildManager: A Collections.synchronizedMap is used as the arbitrary data build queue. Multiple threads call isInBuildQueue (containsKey), addToBuildQueue (put), and ArbitraryDataBuilderThread.run (isEmpty) on it simultaneously. Individual waits reach 6 seconds. Call chains:
ArbitraryDataReader.isBuilding()→ArbitraryDataBuildManager.isInBuildQueue()→SynchronizedMap.containsKey()ArbitraryDataBuilderThread.run()→SynchronizedMap.isEmpty()(spun in a busy-wait pattern)
New finding — AT state pruner/trimmer: HSQLDBATRepository.rebuildLatestAtStates() (pruner) and HSQLDBATRepository.trimAtStates() (trimmer) each hold the HSQLDB repository factory lock for 30+ seconds. During this window, all other threads needing a database connection are blocked. This is why long pauses appear in worker threads during AT processing epochs.
4. Allocation Pressure (ObjectAllocationSample, 207668 samples)
| Site | Weight % | Event count | Object type |
|---|---|---|---|
RowAVLDisk.<init> |
21.6% | 21815 | NodeAVLDisk — disk row read, cache miss |
RowInputBinary.readByteArray() |
11.7% | 13577 | byte[] — per-column read |
RowInputBinary.readBinary() |
5.5% | 6085 | BinaryData |
Peer.sendPreSerializedMessage() |
4.3% | 4347 | Peer$PreSerializedMessageWrapper |
StringConverter.readUTF() |
3.6% | 4582 | String/char[] |
RowStoreAVLDisk.get() |
3.5% | 4123 | Row wrappers |
Arrays.copyOf(byte[],int) |
2.9% | 16264 | byte[] |
ByteArrayOutputStream.<init> |
2.7% | 8839 | ByteArrayOutputStream |
HSQLDB dominates allocation at >45% of weight. The cache miss → disk read → object allocation cycle (RowAVLDisk + readByteArray + readBinary) is the single largest allocation source.
5. Memory Retention — Root Cause of Old-Gen Saturation
This is the most important new finding. Three distinct sources are filling old gen and not releasing:
5a. Network IO — 2MB ByteBuffer per read (~416MB retained)
Peer.readChannel() calls ByteBuffer.allocate(2097161) (2MB+) on every invocation of the IO loop. The Network-IO thread has 113 retained 2MB buffers aged 6+ minutes. NetworkData-IO has 95 more. That is 208 × 2MB = 416MB stuck in old gen from network receive buffers alone.
Root cause: readChannel() allocates a fresh 2MB buffer each call instead of reusing a per-peer buffer. The per-call allocation easily escapes to old gen under G1 because it is large enough to be promoted quickly.
Allocation stack:
Peer.readChannel() line: 783
→ ByteBuffer.allocate(2097161)
→ HeapByteBuffer.<init>(int, int) ← allocates 2MB on every call
5b. MemoryPoW — 8MB long arrays per verification (~360MB retained)
MemoryPoW.verify2() allocates long[1048576] (8MB) per call. Thread pool-23-thread-1 has 26 retained and Thread-8 has 19 retained — 45 × 8MB = 360MB in old gen.
These are scratch buffers for PoW verification. If the verifier is called frequently (e.g., on every incoming block or transaction) and each call allocates a fresh 8MB array, the GC cannot collect them fast enough.
5c. Timer-2 growing ArrayList
Object[47427] in Timer-2 thread, aged 7+ minutes, allocated via ArrayList.add(). This is the same unbounded-growth list identified previously — it has not been fixed. The array continues to grow each time the timer fires.
Combined old-gen retention estimate
| Source | Objects | Size each | Total |
|---|---|---|---|
| Network-IO ByteBuffers | 208 | 2MB | ~416MB |
| MemoryPoW long arrays | 45 | 8MB | ~360MB |
| Timer-2 ArrayList | 1 (growing) | ~1.8MB at 47k entries | ~2MB |
| HSQLDB NodeAVLDisk | 12+ observed | variable | >100MB |
| Total identifiable | ~880MB |
This 880MB of systematically unreclaimed live data explains why old gen fills to 2144MB (268 × 8M regions) and mixed GCs cannot reclaim it — it is all genuinely reachable, just never released.
6. Errors (JavaErrorThrow, 90 events)
- 85 ×
NoSuchMethodErroronDirectMethodHandle$Holder— same as before; bytecode manipulation library version mismatch. - 5 ×
IncompatibleClassChangeError— NEW. These indicate a class that was an interface at compile time is now a class at runtime (or vice versa). Suggests two different versions of the same library on the classpath. Can cause silent behavioral bugs, not just noise.
7. G1 Adaptive IHOP (Marking Trigger)
| Time | Occupancy | Threshold | Alloc rate | Mark duration |
|---|---|---|---|---|
| Start (17:36) | 568MB | 1.2GB (73%) | 12.5 MB/s | 5.88s predicted |
| End (17:58) | 2.1GB | 2.3GB (93%) | 4.0 MB/s | 5.42s predicted |
The IHOP threshold drifted from 73% to 93% of target occupancy. G1's adaptive algorithm is raising the threshold because old gen keeps growing — it is trying to delay concurrent mark starts to allow more old-gen to fill, which paradoxically makes each mixed GC cycle more expensive. Allocation rate dropped from 12.5 to 4.0 MB/s (GC is taking a larger share of CPU time).
Recommendations (priority order)
P0 — Stop the 2MB ByteBuffer leak (fixes ~416MB old-gen growth)
Peer.readChannel() must not allocate a new 2MB buffer per call. Fix: allocate one ByteBuffer per Peer instance (or use a pool) and reuse it across calls. The buffer should be cleared/compact()ed at the start of each read, not reallocated.
// Peer.java — allocate once in constructor or as field:
private final ByteBuffer readBuffer = ByteBuffer.allocate(2097161);
// In readChannel():
readBuffer.clear();
channel.read(readBuffer);
P1 — Pool MemoryPoW scratch buffers (fixes ~360MB old-gen growth)
MemoryPoW.verify2() should take the long[] scratch buffer as a parameter or use a thread-local pool. A ThreadLocal<long[]> initialized once per verifier thread costs nothing and eliminates the 8MB per-call allocation:
private static final ThreadLocal<long[]> SCRATCH =
ThreadLocal.withInitial(() -> new long[1048576]);
P2 — Fix NTP lock contention (removes 25min/22min of blocked time)
Give each NTP$NTPServer instance its own NTPUDPClient. The client is not expensive to create and is clearly not designed for sharing. This is unchanged from the previous recommendation but remains the most impactful threading fix.
P3 — Replace SynchronizedMap in ArbitraryDataBuildManager
Replace Collections.synchronizedMap(new HashMap<>()) with ConcurrentHashMap. The current pattern — multiple threads calling isEmpty(), containsKey(), put() under the same coarse lock — serializes all arbitrary data build operations. ConcurrentHashMap is drop-in for these operations and eliminates the 6.5 min contention total.
P4 — Fix AT state pruner/trimmer lock duration
HSQLDBATRepository.rebuildLatestAtStates() and trimAtStates() hold the repository factory lock for 30+ seconds. These batch operations should either:
- Run on a dedicated background connection (not through the shared factory)
- Release and re-acquire the lock periodically (checkpoint approach)
- Run during off-peak windows if the blockchain allows
P5 — Bound the Timer-2 ArrayList
Find the periodic task in Timer-2 that calls ArrayList.add(). Cap its size (e.g., keep the last N entries) or replace it with a ring buffer. It has been growing for 7+ minutes with 47,427 entries and is never cleared.
P6 — G1 tuning for current workload
-XX:G1HeapRegionSize=16m # raise humongous threshold (2MB buffers won't be humongous at 16M region size)
-XX:G1MixedGCCountTarget=16 # spread mixed GC work over more collections
-XX:G1HeapOccupancyPercent=35 # start concurrent mark earlier (IHOP drifted to 93%!)
-XX:MaxGCPauseMillis=200 # match target to reality; 150ms is too tight for this heap size
-Xmx4096m # consider increasing max heap; process RSS already reaches 4GB
The most urgent flag is -XX:G1HeapOccupancyPercent=35 — without it, G1 will keep waiting until 90%+ occupancy before starting concurrent mark, giving mixed GCs no room to work.
P7 — Cache Base58 results
Base58.divMod58 is at 7% CPU and growing. The growth implies encode/decode is called proportionally to chain state queries. A small LRU cache (e.g., Guava CacheBuilder.maximumSize(10000)) at the Base58.encode/decode boundary would flatten this entirely.
P8 — Resolve IncompatibleClassChangeError
5 IncompatibleClassChangeError events indicate a classpath conflict (not just method signature mismatch like the NoSuchMethodErrors). Run with -verbose:class and look for duplicate class definitions loaded by different classloaders. This can cause subtle data corruption, not just noise.