Until now, we required a perfect success rate when syncing with a peer via Synchronizer.syncToPeerChain(). Blocks were requested individually, but the node would give up and lose all progress if a single request failed. In practice, this happened very regularly, and it was difficult to succeed when there were a large number of blocks (e.g. 20+) that needed to be requested.
This commit adds two retry mechanisms, causing each of the two request types (block sigs and blocks) to retry 3 times before giving up, potentially avoiding a lot of wasted work. The number of retries is configurable in the MAXIMUM_RETRIES constant, which we could move to settings at some point if this feature proves useful.
The original issue seemed to result in a few side effects:
1. Nodes would spend a large amount of time requesting blocks from peers, only to throw it all away afterwards. This potentially added to network congestion, as nodes were using unnecessary network time to unproductively serve peers.
2. A large number of sync attempts were failing, particularly when a fork emerged with a significant number of divergent blocks (20+). This issue reduced the ability for nodes to sync to the correct chain while they still had time to do so. With every block that passed, it became made it more and more difficult to switch to the correct chain. Eventually, the correct chain would become TOO_DIVERGENT at which point there is no way to automatically switch without manual intervention. I hope that this retry mechanism will increase the chances of nodes automatically moving onto the right chain quickly, avoiding the need for a user to intervene.
3. The POST /admin/forcesync API was unlikely to succeed when the peer's chain had started to diverge from the user's chain. This should increase the success rate.
Also included in this commit is a MAXIMUM_BLOCK_SIGNATURES_PER_REQUEST constant. This limits the number of block sigs requested in each batch (default 200). Without this, we are unable to increase MAXIMUM_COMMON_DELTA because it can try and request thousands of block sigs at once, which unsurprisingly doesn't succeed.
This bug often prevented the correct amount of block signatures (and blocks) from being requested from a peer, when trying to sync to it.
It could result in quite serious consequences, as it would trigger orphaning back to the common block without first requesting all of the necessary blocks from the peer's chain. Rather than applying a complete copy of the peer's chain, it could orphan back to the common block and then only apply a few blocks beyond that, leaving the node in an unexpected state, potentially hundreds of blocks behind the peer's current height, which it then has to try and obtain from other peers.
When there are forks present, this could result in it hopping from chain to chain, each time being unable to fully synchronise with the peer. Given that we currently discard our chain if it is deemed that our latest block isn't "recent", it is very important that nodes are brought up to the latest block when synchronising with a peer, to avoid constantly triggering discards.
The severity of this bug increased when there was a large disparity between the peer's latest block and the common block height, and prevented us from being able to increase MAXIMUM_COMMON_DELTA.
These are simpler than the level 1+2 tests; they only test that the rewards are correct for each level post-shareBinFix. I don't think we need multiple instances of the pre-shareBinFix or block orphaning tests. There are a few subtle differences between each test, such as the online status of Bob, in order the make the tests slightly more comprehensive.
1. Assign 3 minters (one founder, one level 1, one level 2)
2. Mint a block after the shareBinFix, ensuring that level 1 and 2 are being rewarded evenly from the same share bin.
3. Orphan the block and ensure the rewards are reversed.
4. Orphan two more blocks, each time checking that the balances are being reduced in accordance with the pre-shareBinFix mapping.
As importing a transaction requires blockchain lock, all the network threads
can be used up blocking for that lock, especially if Synchronizer is active.
So we simply discard incoming TRANSACTION messages if we can't immediately
obtain the blockchain lock. Some other peer will probably attempt to
send the transaction soon again anyway.
Plus we swap transaction lists after connection handshake.
Post trigger, account levels will map correctly to share bins, subtracting 1 to account for the 0th element of the shareBinsByLevel array.
Pre-trigger, the legacy mapping will remain in effect.
Post trigger, this change will use all 128 bytes of previous block's signature when
calculating/validating next block's "minter" signature (itself the first 64 bytes of a block signature).
Prior to trigger, current behaviour is to only use first 64 bytes of previous block's
signature, which doesn't encompass transactions signature.
New block sig code should help reduce forking and help improve transactional
security.
Added "newBlockSigHeight" to blockchain.json but initially set to block 999999
pending decision on when to merge, auto-update, go-live, etc.
Symptoms of a CHECKPOINT-related DB deadlock:
On Controller thread:
"Controller" #20 prio=5 os_prio=31 cpu=1577665.56ms elapsed=17666.97s allocated=475G defined_classes=412 tid=0x00007fe99f97b000 nid=0x1644b waiting on condition [0x0000700009a21000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@14.0.2/Native Method)
- parking to wait for <0x0000000602f2a6f8> (a org.hsqldb.lib.CountUpDownLatch$Sync)
[...some more lines...]
[this next line is the best indicator: ]
at org.qortal.repository.hsqldb.HSQLDBRepository.checkpoint(HSQLDBRepository.java:385)
at org.qortal.repository.RepositoryManager.checkpoint(RepositoryManager.java:51)
at org.qortal.controller.Controller.run(Controller.java:544)
Other threads stuck at:
- parking to wait for <0x00000007ff09f0b0> (a org.hsqldb.lib.CountUpDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base@14.0.2/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@14.0.2/AbstractQueuedSynchronizer.java:714)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@14.0.2/AbstractQueuedSynchronizer.java:1046)
at org.hsqldb.lib.CountUpDownLatch.await(Unknown Source)
at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
Could have affected:
Controller.deleteExpiredTransactions()
Network.getConnectablePeer()
Network.opportunisticMergePeers()
Network.prunePeers()
Symptoms:
2021-02-12 16:46:06 WARN NetworkProcessor:152 - [1556] exception while trying to produce task
java.lang.NullPointerException: null
at org.qortal.repository.hsqldb.HSQLDBRepository.<init>(HSQLDBRepository.java:92) ~[qortal.jar:1.4.1]
at org.qortal.repository.hsqldb.HSQLDBRepositoryFactory.tryRepository(HSQLDBRepositoryFactory.java:97) ~[qortal.jar:1.4.1]
at org.qortal.repository.RepositoryManager.tryRepository(RepositoryManager.java:33) ~[qortal.jar:1.4.1]
at org.qortal.network.Network.getConnectablePeer(Network.java:525) ~[qortal.jar:1.4.1]
This is a serious change as it affects many callers.
Controller.onNetworkTransactionMessager()
-- should be OK to block as it's only one EPC thread
TransactionsResource.processTransaction()
-- should be OK to block as it's only one Jetty thread
AND has its own 30s timeout wrapper anyway
Implementations of AcctTradeBot.progress()
e.g. BitcoinACCTv1TradeBot.progress() & LitecoinACCTv1TradeBot.progress()
TradeBot.updatePresence()
-- these are called via BlockMinter/Synchronizer
when blockchain lock is already held, so will are unaffected
AcctTradeBot.createTrade() and AcctTradeBot.startResponse()
-- these are called via API
and previously would only perform non-blocking blockchainLock.try()
but now perform blocking blockchainLock.lock()
thus preventing NO_BLOCKCHAIN_LOCK when creating/responding to trades
especially after a (wasted) MESSAGE PoW compute
but with potential downside that API response might be delayed
e.g. by a very slow sync round
Future work could look into removing the need for blockchain lock
when calling Transaction.importAsUnconfirmed().
This table was previously defined using the TEMPORARY keyword as the rows were used as a cached/index to speed up AT trimming SQL statements.
However, the table definition was lacking the "ON COMMIT PRESERVE ROWS" clause, and possibly the "GLOBAL" keyword, which caused the table contents to be emptied, immediately after being filled, when <tt>repository.saveChanges()</tt> was called (i.e. "COMMIT").
This caused latest AT states to be trimmed in error.
AtRepositoryTests.testGetLatestATStatePostTrimming also fixed so that it fails with the previous, broken code.
This table was previously defined using the TEMPORARY keyword as the rows were used as a cached/index to speed up AT trimming SQL statements.
However, the table definition was lacking the "ON COMMIT PRESERVE ROWS" clause, and possibly the "GLOBAL" keyword, which caused the table contents to be emptied, immediately after being filled, when <tt>repository.saveChanges()</tt> was called (i.e. "COMMIT").
This caused latest AT states to be trimmed in error.
AtRepositoryTests.testGetLatestATStatePostTrimming also fixed so that it fails with the previous, broken code.
According to Bitcoin source, CheckFinalTx() in validation.cpp ~line 223,
we need to make sure median blocktime has passed P2SH refund transaction's
nLockTime.
Previously we were erroneously checking that median blocktime was in the past.
This should fix issues where refunding P2SH results in a "non-final" error
from the ElectrumX server network.
PRESENCE transactions were previously validated using Bob's trade key (in address form).
But as PRESENCE transactions are already emitted by Alice, her trade key is also used
(if present in trade data by virtue of AT being locked to Alice).
Similarly, Alice's trade-bot won't even try to build PRESENCE transactions if her
trade key isn't publicly visible to other peers, i.e. after AT is locked to Alice.
This aids matching PRESENCE to corresponding trade offers for use in UI.
Also tighten up visibility of some fields in ChainChainOfferSummary and
PresenceInfo to private.
PresenceInfo.address should map to CrossChainOfferSummary.qortalCreatorTradeAddress
which is "AT creator's ephemeral trading key-pair represented as Qortal address"
Add caching of transactions fetched via ElectrumX to reduce network load and speed up API response.
Fix handling ElectrumX servers that don't want to supply verbose transaction JSON.
Hide lots of data in BitcoinyTransaction that isn't needed by current API users.
Previous fixes for "transaction rollback: serialization failure" when updating trim heights
in commits 16397852 and 58ed7205 had the right idea but were broken due to being synchronized
on different objects.
this.repository.trimHeightsLock would be a new Object() for each repository connection/session
and so not actually synchronize concurrent updates.
Implicit saveChanges()/COMMIT is still needed.
Fix is to use a repository-wide object for synchronization - in this case the repositoryFactory
object as held by RepositoryManager.
Added test to cover.
Also reduced DB trim height read to one call at start of thread for both trimming threads.
Under certain conditions, e.g. non-existent database files, the repository would be created
and then immediately be re-created.
Not only was this unnecessary, but HSQLDBDatabaseUpdates would attempt to export the node-local
data twice, which would cause an error due to existing .script files.
The fix is three-pronged:
1. Don't immediately rebuild the repository if it's only just been built
2. Don't export the empty node-local data if repository has only just been built
3. Don't export the node-local data if it's empty
This involves a database reshape, but before this happens the node-local
data is exported to local files, giving the user the option to use a
bootstrap file instead of waiting.
apiKey in settings is null by default at this point, for backwards compatibility.
In the future, the Windows installer could generate a UUID for apiKey.
apiKey in settings needs to be at least 8 characters.
API calls in the documentation engine are now marked with an open/closed padlock
to show where API key might be required.
Add support for API key security, where X-API-KEY header must match apiKey from settings
apiKey in settings is null by default at this point, for backwards compatibility.
In the future, the Windows installer could generate a UUID for apiKey.
apiKey in settings needs to be at least 8 characters.
API calls in the documentation engine are now marked with an open/closed padlock
to show where API key might be required.
Checkpointing interval is 1 hour by default, changable in settings via
"repositoryCheckpointInterval"
plus corresponding "showCheckpointNotifications" SysTray flags (off by default).
Added entries to SysTray_en i18n properties, and converted SysTray_ru to ISO-8559-1.
Instead of searching from block 0, we now keep a record of
base trim height in the DB itself.
Also, we no longer trim the latest AT state for non-finished ATs
in case they are in deep sleeping and we need their state for when
they awaken.
Symptoms are:
* db/blockchain.log is pretty much exactly 50MB - the checkpoint-triggering size.
* Loads of threads are stuck waiting for HSQLDB's CountUpDownLatch$Sync.await()
* Synchronizer, or some other thread, possibly orphaning blocks.
The cause seems to be method A, which has a repository session,
calls EventBus.INSTANCE.notify() and one of the event listeners
then obtains their own repository session to do repository 'work'.
In the meantime, the HSQLDB log has reached 50MB, triggering auto-checkpoint.
HSQLDB attempts to CHECKPOINT, but waits for existing transactions
to complete, and also blocks starting new transactions.
Thus, one of the event listeners is blocked when they try to obtain
a new repository session, but HSQLDB never performs CHECKPOINT
because the event notifier (method A) still has an unfinished
transaction - hence deadlock.
Drop created_when column from ATStates as it never changes
and can be fetched from ATs table.
This takes about 50s on a fast machine.
Correspondingly rebuild height-based index on ATStates.
This takes about 3 minutes on a fast machine.
Modify AT-related repository methods and callers.
Aggressively remove 'old' (> 2 weeks) actual AT
state binary data, leaving only the hash in DB
(for syncing purposes). Seems to keep up with
syncing from another node on localhost.
Additional benefit is to speed up syncing if node has trade-bot entries,
as a batch of blocks processed within 30s will have the same HTLC responses
as neither Bitcoin nor Litecoin blocks are that fast. Once synced, the next
Qortal block (~60s) should pick up any new changes. Generally users will be
synced when using trade-bot anyway.
Also moved bitcoiny.getAddressTransactions(p2shAddress)
into BitcoinyHTLC.findHtlcSecret() so only called if secret,
or lack thereof, is not cached.
Added tests to cover caching.
Several cross-chain API calls moved into separate classes,
although most of the URLs remain roughly the same to provide
backwards compatibility.
API /crosschain/at/build moved into /crosschain/BitcoinACCTv1
Converted DELETE /crosschain/tradeoffer to be ACCT-agnostic.
Changes to ACCT interface, etc. to support above.
Changes applied to other crosschain API calls to make them
independent of Bitcoin/Litecoin.
Corrections to fee calculations and usage in BitcoinACCTv1.
Added new LitecoinACCTv1 trade-bot, using LitecoinACCTv1.
Some minor typo corrections, rename of secretHash to hashOfSecret.
Some more Bitcoin-specific fields deprecated, but values duplicated
from newly-named fields for now.
Lower default fee (10sats/byte) for Litecoin spending transactions.
(Not P2SH fees which are 1000sats).
Changed ApiError INSUFFICIENT_BALANCE HTTP status from 422 to 402
as 422 isn't supported by Jetty?
CrossChainTradeSummary.btcAmount deprecated, use: foreignAmount
Modified pom.xml to generated package-info.java files for classes
inside org.qortal.api.model.** subdirectories.
API support for Litecoin wallet balance and sending LTC.
TradeBotCreateRequest rejigged to use blockchain-agnostic
field names, e.g. bitcoinAmount now foreignAmount,
and added foreignBlockchain field.
The massive API CrossChainResource class has been split into:
CrossChainAtResource: for building TRADE/REDEEM/CANCEL messages
(OFFER missing?)
CrossChainBitcoinResource: for Bitcoin wallet balance/spend
CrossChainLitecoinResource: ditto for Litecoin
CrossChainHtlcResource: for Bitcoiny-HTLC actions like:
deriving P2SH address
checking HTLC status
eventually: building refund/redeem transactions
CrossChainResource: for creating/cancelling/listing trade offers.
CrossChainTradeBotResource: for creating/cancelling trade-bot
entries, including responding to trade offers.
---
Other general trading changes:
TradeBot states are now specific to each individual trade-bot,
e.g. BitcoinACCTv1TradeBot or LitecoinACCTv1TradeBot, etc.
TradeBot states now a combination of int & String, instead of
enums due to above.
Extra columns added to DB TradeBotStates to store
blockchain, which ACCT in use, etc.
---
UNTESTED at this point!
Extracted AcctMode from BitcoinACCTv1.Mode as the values are
common to both Bitcoin/Litecoin ACCTs.
Added test apps for deploy, cancel, trade and redeem of LitecoinACCTv1.
Added altcoinj library as Maven dependency.
Added new Litecoin subclass of Bitcoiny,
with mainnet and testnet ElectrumX server lists.
Added litecoinNet settings variable and getter.
Added LitecoinTests.
Most tests work but testFindHtclSecret()
needs a redeemed HTLC on chain (not yet done).
Added litecoinNet to some test settings files
in resources.
Added Litecoin BuildHTLC, Refund test apps.
Added SendLTC app as Electrum-LTC seems a bit flaky?
So far managed to build HTLC P2SH, fund it and then
refund it!
---
As Bitcoin and Litecoin are both subclasses of Bitcoiny,
could unify some test apps with added Bitcoin/Litecoin
switch as first arg?
Bitcoin/Litecoin common aspects extracted in a "Bitcoiny" common class.
So:
Bitcoin (was BTC) extends Bitcoiny
Litecoin (future code) will also extend Bitcoiny
ElectrumX is now a BitcoinyBlockchainProvider
to allow easier future replacement and also tidier integration.
BTCP2SH is now BitcoinyHTLC as they are generic hash time-locked contracts,
probably Bitcoin/Litecoin agnostic.
BTCACCT is now BitcoinACCTv1, allowing for v2+ and also LitecoinACCTv1, etc.
BitcoinTransaction is now BitcoinyTransaction
as they are pretty much the same in Litecoin.
BitcoinException is now a more generic ForeignBlockchainException.
---
Bitcoiny subclasses instantiate a new BitcoinyBlockchainProvider
when creating their singleton instance. They pass relevant network
details to the BBP, like server lists, genesis block hash, etc.
Bitcoiny.WalletAwareUTXOProvider now only has the one key search mode
that is equivalent to the old REQUEST_MORE_IF_ANY_SPENT.
Tests tidied up.
---
Still to do:
Modifying TradeBot to handle multiple types of ACCTs,
like BitcoinACCTv2, LitecoinACCTv1...
Modifying API to support multiple types of ACCTs.
Actually add Litecoin support.
Build new ACCT without needing P2SH-B if possible.
There's still an existing issue where log entries like this appear:
Unable to trim old online accounts signatures in repository
which is actually caused by:
integrity constraint violation: unique constraint or index violation; SYS_PK_10092 table: BLOCKS
which seems to be a bug in the version of HSQLDB we use.
(Tested using synced-from-scratch DB).
It's not clear what the actual problem is at this point.
It might be possible to switch to v2.5.1 if our recent HSQLDB-related
commits have fixed/worked-around the OOM issues.
Move the inner method from BlockChain to Controller.
Remove blockchain lock as it's not needed because it's not an
HSQLDB "serialization failure" but constraint violation.
Trimming old online accounts signatures limited to batches of 1440
rows to reduce CPU and memory load.
Added separate method to determine status of P2SH transactions,
returning UNFUNDED, FUNDING_IN_PROGRESS, REDEEMED, etc.
Added code to trade-bot to increase robustness. Lots more
changes including unified state change/logging, checking
for existing MESSAGEs, etc.
Added missing websocket methods to silence log noise.
Trade-bot now called per block during synchronization,
instead of per batch, to pick up edge cases where some
potential trade-bot transitions were missed, resulting
in failed trades.
Corresponding changes in Controller, such as notifying
event bus of new block in same thread (thus blocking)
instead of using executor.
Added slightly more robust common block determination
to Synchronizer.
Refactored code in BTC class to use new BitcoinException
rather than simply returning null, with added sub-classes
allowing differentiation between network issues or fund
issues.
Changed BTC.buildSpend to try harder to find UXTOs to
address false "insufficient funds" issues.
Repository change to add index on MessageTransactions
for quicker look-up of trade-related messages.
Reduced reliance on bitcoinj library in BTCP2SH.
Reworked ElectrumX to better detect errors rather than
continuously try more servers to no avail.
Also added genesis block check in case of servers on
different Bitcoin networks.
Now tries to extract upstream bitcoind error codes
and pass those up to caller via exceptions.
Updated list of testnet servers.
MemoryPoW now detects thread interrupt and exits fast.
Moved some non-generic transaction-related repository
methods to their own subclass. For example:
moved TransactionRepository.getMessagesByRecipient
to MessageRepository.getMessagesByParticipants
Updated and added more tests.
Requires node shutdown, lots of time (10s of minutes), spare storage space.
Called via: java -cp qortal.jar org.qortal.RepositoryMaintenance
Not (yet) for general consumption.
Added Qortal-side HSQLDB PreparedStatement cache, hashed
by SQL query string, to reduce re-preparing statements.
(HSQLDB actually does the work in avoiding re-preparing
by comparing its own query-to-statement cache map, but we
need to keep an 'open' statement on our side for this to
happen).
Support added for batched INSERT/UPDATE SQL statements to
update many rows in one call.
Several specific repository calls, e.g. modifyMintedBlockCount
or modifyAssetBalance, now have batch versions that allow
many rows to be updated in one call.
In Block, when distributing block rewards, although we still
build a map of balance changes to apply after all calculations,
this map is now handed off wholesale to the repository to
apply in one (or two) queries, instead of a repository call
per account. The balanceChanges map is now keyed by account
address, as opposed to actual Account.
Also in Block, we try to cache the fetched online reward-shares
(typically in Block.isValid et al) to avoid re-fetching them
later when calculating block rewards.
In addition, actually fetching online reward-shares is no longer
done index-by-index, but the whole array of indexes is passed
wholesale to the repository which then returns the corresponding
reward-shares as a list.
In Block.increaseAccountLevels, blocks minted counts are also
updated in one single repository call, rather than one
repository call per account.
When distributing Block rewards to legacy QORA holders,
all necessary info is fetched from the repository in one hit
instead of two-phases of: 1. fetching eligible QORA holders,
and 2. fetching extra data for that QORA holder as needed.
In addition, updated QORT_FROM_QORA asset balances are done
via one batch repository call, rather than per update.
Symptoms include this in logs:
Unexpected zero effective minter level for reward-share %s - using 1 instead!
This occurs when Synchronizer compares two sub-chains from a common block,
and one of the blocks is signed by a reward-share key that has
subsequently been cancelled.
Although this is catered for, excessive log-spam is emited.
So in addition to demoting the log level from WARN to DEBUG,
more code has been added to try harder to find the actual data needed,
thus preventing the logging in the first place.
New repository transaction search method added to support above,
along with corresponding tests.