Partitioned Inflationary Rewards Distribution
With the increase of number of stake accounts, computing and redeeming the stake rewards at the start block of the epoch boundary becomes very expensive. Currently, with 550K stake accounts, the stake reward time has already taken more than 10 seconds. This prolonged computation slows down the network, and can cause large number of forks at the epoch boundary, which makes the matter even worse.
Instead of computing and reward stake accounts at epoch boundary, we will decouple reward computation and reward credit into two phases.
A separate service, "EpochRewardCalculationService" will be created. The service
will listen to a channel for any incoming rewards calculation requests, and
perform the calculation for the rewards. For each block that cross the epoch
boundary, the bank will send a request to the
This marks the start of the reward computation phase.
N-1 -- N -- N+1
In the above example, N is the start of the new epoch. Two rewards calculation
requests will be sent out at slot N and slot N+2 because they both cross the
epoch boundary and are on different forks. To avoid repeated computation with
the same input, the signature of the computation requests,
hash(stake_accounts_data), hash(vote_accounts), hash(delegation_map)), are
calculated. Duplicated computation requests will be discard. For the above
example, if there are no stake/vote accounts changes between slot N and slot
N+2, the 2nd computation request will be discarded.
When reaching block height
N after the start of the
phase, the bank starts the second phase - reward credit, in which, the bank
first query the
epoch calc service with the request signature to get the
rewards result, which will be resented as a map from accounts_pubkey->rewards,
then credit the rewards to the stake accounts for the next
M blocks. If the
rewards result is not available, the bank will wait until the results are
We call them:
(a) calculating interval:
(b) credit interval:
[epoch_start+N+1, epoch_start+N+M], respectively.
And the combined interval
[epoch_start, epoch_start+N+M] is called
N is chosen to be sufficiently large so that the
background computation should have completed and the result of the reward
computation is available at the end of
N can be fixed
such as 100 (roughly equivalent to 50 seconds), or chosen as a function of the
number of stake accounts,
credit interval, the bank will fetch the reward computation results from
the background thread and start credit the rewards during the next
The idea is partition the accounts into
M partitions. And each block, the bank
1/M accounts. The partition is required to be deterministic for the
current epoch, but must also be random across different epochs. One way to
achieve these properties is to hash the account's pubkey with some epoch
dependent values, sort the results, and divide them into
M bins. The epoch
dependent value can be the epoch number, total rewards for the epoch, the leader
pubkey for the epoch block, etc.
M can be choses based on 50K account per
block, which equal to
num_stake_account is extracted from
leader_schedule_epoch block, so we don't
run into discrepancy where new transactions right before an epoch boundary
creates one fork with
X stake accounts and another fork with
Y stake accounts.
In order to avoid putting extra burden of computing and credit the stake reward
for blocks produced during the
rewarding interval, we can reduce the compute
budget limits on those blocks in
rewarding interval, and reserve some computing
and read/write capacity to perform stake rewarding.
- stake accounts reads/writes during the
epoch_start..epoch_start+N+M Because of the delayed credit of the rewards,
Reads to those stake accounts will not return the value that the user are
expecting (viz. not include the recent epoch stake rewards). Writes to those
stake accounts will be lost once the reward are credited on block
epoch_start+N+M. We will need to modify the runtime to restrict read/writes to
stake accounts during the
rewarding interval. Any transactions, which involves
stake accounts, will result in a new execution error, i.e. "stake rewards
pending, account access is restricted". However, normal rpc queries, such as
'getBalance', will return the current lamport of the account. The user can
expect the rewards to be credit as some time point during the 'rewarding
- voting during
During reward interval, vote transactions must be processed normally for achieving consensus and making progress for rooted blocks. However, those vote transactions may potentially change the vote accounts balance (i.e. pay for the voting transaction fee if vote_account and block reward recipient accounts are the same), before the epoch rewards are paid. When the epoch rewards are paid, those block rewards will be wiped out by the stale cached value. To prevent this, we will enforce that the vote_account and authorized_voter authority must be different.
- snapshot taken during the
If a snapshot is taken during the
rewarding interval, it would miss the
rewards for the stake accounts. Any plain restart from those snapshots will be
wrong, unless we reconstruct the rewards from the recent epoch boundary. This
will add some complexity to validator restart. In the first implementation, we
will force not taking any snapshot and not performing accounts hash
calculation during the
rewarding interval. Incremental snapshot request will
be skipped. Full snapshot request will be re-queued be picked up later at the
end of the
In future, if needed, we can revisit to enable taking snapshots and perform hash calculation during reward interval.
- account-db related action during the
Account-db related action such as flush, clean, squash, shrink etc. may touch
and evict the stake accounts from account db's cache during the
interval. This will slow down the credit in the future at bank
We may need to exclude such accounts_db actions for stake_accounts during
rewarding interval. This is going to be a performance tuning problem. In the
first implementation, for simplicity, we will keep the account-db action as it
is, and make the
credit interval larger to accommodate the performance hit
when writing back those accounts. In future, we can continue tuning account db
actions during 'rewarding interval'.
- view of total epoch capitalization change
The view of total epoch capitalization, instead of being available at every
epoch boundary, is only available after the
rewarding interval. Any third
party application logic, which depends on total epoch capitalization, need to
getInflationRewardJSONRPC API method call
getInflationReward JSONRPC API method call can simply grab the
first block in the target epoch and lookup the target stake account's rewards
entry. With these changes, the call will need updated to derive the target
stake account's credit block, grab that block, then lookup rewards.
Additionally we'll need to return more informative errors for queries made
during the lockout period, so users can know that their rewards are pending for
the target epoch. A new rpc API, i.e.
getRewardInterval, will be added for
rewarding interval for the current epoch.