Currently set to 1, as serialization of the BlocksMessage data on mainnet is too slow to use this for any significant number of blocks right now. Hopefully we can find a way to optimise this process, which will allow us to use this for multiple block syncing.
Until then, sticking with single blocks should still be enough to help solve the network congestion and re-orgs we are seeing, because it gives us the ability to request the next block based on the previous block's signature, which was unavailable using GET_BLOCK. This removes the requirement to fetch all block signatures upfront, and therefore it shouldn't matter if the peer does a partial re-org whilst a node is syncing to it.
If fast syncing is enabled in the settings (by default it's disabled) AND the peer is running at least v1.5.0, then it will route through to a new method which fetches multiple blocks at a time, in a very similar way to Synchronizer.syncToPeerChain().
If fast syncing is disabled in the settings, or we are communicating with a peer on an older version, it will continue to sync blocks individually.
When communicating with a peer that is running at least version 1.5.0, it will now sync multiple blocks at once in Synchronizer.syncToPeerChain(). This allows us to bypass the single block syncing (and retry mechanism), which has proven to be unviable when there are multiple active forks with several blocks in each chain.
For peers below v1.5.0, the logic should remain unaffected and it will continue to sync blocks individually.
Until now, we required a perfect success rate when syncing with a peer via Synchronizer.syncToPeerChain(). Blocks were requested individually, but the node would give up and lose all progress if a single request failed. In practice, this happened very regularly, and it was difficult to succeed when there were a large number of blocks (e.g. 20+) that needed to be requested.
This commit adds two retry mechanisms, causing each of the two request types (block sigs and blocks) to retry 3 times before giving up, potentially avoiding a lot of wasted work. The number of retries is configurable in the MAXIMUM_RETRIES constant, which we could move to settings at some point if this feature proves useful.
The original issue seemed to result in a few side effects:
1. Nodes would spend a large amount of time requesting blocks from peers, only to throw it all away afterwards. This potentially added to network congestion, as nodes were using unnecessary network time to unproductively serve peers.
2. A large number of sync attempts were failing, particularly when a fork emerged with a significant number of divergent blocks (20+). This issue reduced the ability for nodes to sync to the correct chain while they still had time to do so. With every block that passed, it became made it more and more difficult to switch to the correct chain. Eventually, the correct chain would become TOO_DIVERGENT at which point there is no way to automatically switch without manual intervention. I hope that this retry mechanism will increase the chances of nodes automatically moving onto the right chain quickly, avoiding the need for a user to intervene.
3. The POST /admin/forcesync API was unlikely to succeed when the peer's chain had started to diverge from the user's chain. This should increase the success rate.
Also included in this commit is a MAXIMUM_BLOCK_SIGNATURES_PER_REQUEST constant. This limits the number of block sigs requested in each batch (default 200). Without this, we are unable to increase MAXIMUM_COMMON_DELTA because it can try and request thousands of block sigs at once, which unsurprisingly doesn't succeed.
This bug often prevented the correct amount of block signatures (and blocks) from being requested from a peer, when trying to sync to it.
It could result in quite serious consequences, as it would trigger orphaning back to the common block without first requesting all of the necessary blocks from the peer's chain. Rather than applying a complete copy of the peer's chain, it could orphan back to the common block and then only apply a few blocks beyond that, leaving the node in an unexpected state, potentially hundreds of blocks behind the peer's current height, which it then has to try and obtain from other peers.
When there are forks present, this could result in it hopping from chain to chain, each time being unable to fully synchronise with the peer. Given that we currently discard our chain if it is deemed that our latest block isn't "recent", it is very important that nodes are brought up to the latest block when synchronising with a peer, to avoid constantly triggering discards.
The severity of this bug increased when there was a large disparity between the peer's latest block and the common block height, and prevented us from being able to increase MAXIMUM_COMMON_DELTA.
As importing a transaction requires blockchain lock, all the network threads
can be used up blocking for that lock, especially if Synchronizer is active.
So we simply discard incoming TRANSACTION messages if we can't immediately
obtain the blockchain lock. Some other peer will probably attempt to
send the transaction soon again anyway.
Plus we swap transaction lists after connection handshake.
Post trigger, this change will use all 128 bytes of previous block's signature when
calculating/validating next block's "minter" signature (itself the first 64 bytes of a block signature).
Prior to trigger, current behaviour is to only use first 64 bytes of previous block's
signature, which doesn't encompass transactions signature.
New block sig code should help reduce forking and help improve transactional
security.
Added "newBlockSigHeight" to blockchain.json but initially set to block 999999
pending decision on when to merge, auto-update, go-live, etc.
Symptoms of a CHECKPOINT-related DB deadlock:
On Controller thread:
"Controller" #20 prio=5 os_prio=31 cpu=1577665.56ms elapsed=17666.97s allocated=475G defined_classes=412 tid=0x00007fe99f97b000 nid=0x1644b waiting on condition [0x0000700009a21000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@14.0.2/Native Method)
- parking to wait for <0x0000000602f2a6f8> (a org.hsqldb.lib.CountUpDownLatch$Sync)
[...some more lines...]
[this next line is the best indicator: ]
at org.qortal.repository.hsqldb.HSQLDBRepository.checkpoint(HSQLDBRepository.java:385)
at org.qortal.repository.RepositoryManager.checkpoint(RepositoryManager.java:51)
at org.qortal.controller.Controller.run(Controller.java:544)
Other threads stuck at:
- parking to wait for <0x00000007ff09f0b0> (a org.hsqldb.lib.CountUpDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base@14.0.2/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@14.0.2/AbstractQueuedSynchronizer.java:714)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@14.0.2/AbstractQueuedSynchronizer.java:1046)
at org.hsqldb.lib.CountUpDownLatch.await(Unknown Source)
at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
Could have affected:
Controller.deleteExpiredTransactions()
Network.getConnectablePeer()
Network.opportunisticMergePeers()
Network.prunePeers()
Symptoms:
2021-02-12 16:46:06 WARN NetworkProcessor:152 - [1556] exception while trying to produce task
java.lang.NullPointerException: null
at org.qortal.repository.hsqldb.HSQLDBRepository.<init>(HSQLDBRepository.java:92) ~[qortal.jar:1.4.1]
at org.qortal.repository.hsqldb.HSQLDBRepositoryFactory.tryRepository(HSQLDBRepositoryFactory.java:97) ~[qortal.jar:1.4.1]
at org.qortal.repository.RepositoryManager.tryRepository(RepositoryManager.java:33) ~[qortal.jar:1.4.1]
at org.qortal.network.Network.getConnectablePeer(Network.java:525) ~[qortal.jar:1.4.1]