Jarrod Johnson
4c8ba92856
Change configuration sync to use msgpack
...
This removes use of pickle for config sync over network.
2020-01-27 15:53:29 -05:00
Jarrod Johnson
97ca6dc48e
Provide more detail on leader when leader is lost
2019-10-21 13:55:43 -04:00
Jarrod Johnson
a84b88e269
Fix mistake in the expression change
2019-10-14 15:02:45 -04:00
Jarrod Johnson
fc626d36ba
Fix greenlet 'isAlive'
...
There is no 'isAlive' in a greenlet.
2019-10-14 13:59:24 -04:00
Jarrod Johnson
8cab591a8b
Add collective member deletion
...
This allows deletion of a dead member, down to deleting down to non-collective
mode.
2019-10-10 11:30:03 -04:00
Jarrod Johnson
c1953bdad3
Another set of python 3 compatibility
...
Numerous issues arose, particularly
when participating in a mixed
collective.
2019-10-08 10:45:43 -04:00
Jarrod Johnson
8fc3b7c9c0
Implement cross-python collective compat
...
This enables cross-version compatibility
for a collective.
2019-10-07 15:41:38 -04:00
Jarrod Johnson
3105b9b1f9
Significantly rework the collective startup behavior
...
One, make the tracking bools enforce a lock to reduce confusion
Treat an initializing peer as failed, to avoid getting too fixated
on an uncertain target.
Make sure that no more than one follower is tried at a time by
killing before starting a new one, and syncing up the configmanager
state
Decline to act on an assimilation request if we are trying to connect
and also if the current leader asks us to connect and we already are.
Avoid calling get_leader while connecting, as that can cause a member
to decide to become a leader while trying to connect, by swapping
the reactions to the connect request.
Avoid trying to assimilate existing followers.
Fix some logging.
2018-10-12 11:45:23 -04:00
Jarrod Johnson
f525c25ba6
Provide more verbose collective logging
...
This helps understand the flow in practice of collective behavior.
2018-10-11 15:15:11 -04:00
Jarrod Johnson
be930fc076
Add missing subsystem marker from a collective log
2018-10-10 16:30:28 -04:00
Jarrod Johnson
32ddb33de3
Fix error when trying to do fullsync without globals yet
...
If globals is missing, then do not break the sync trying to handle it
2018-10-10 13:11:15 -04:00
Jarrod Johnson
b77ed8dbff
Fix config sync on dead writer
...
The sync thread can die without clearing syncrunning. Make sure that
the thread is alive *and* that the thread has not indicated
intent to give up.
2018-10-10 13:07:27 -04:00
Jarrod Johnson
d86e1fc4eb
Give the cfg init a lock
...
Move collective manager and configmanager to share a configinitlock,
so that bad timings during internal initialization and collective
activity cannot interfere and produce corrupt database.
This became an issue with the fix for 'everything' disappearing.
2018-10-02 10:17:44 -04:00
Jarrod Johnson
78a1741e0e
Fix usage of check_quorum()
...
It is not a boolean, it is exception driven.
2018-10-01 16:02:16 -04:00
Jarrod Johnson
4329c1d388
Have collective start bail out if leader
...
Leader should not relinquish if quorum, so don't bother in such
a case.
2018-10-01 15:50:49 -04:00
Jarrod Johnson
b0b5493ff7
Cancel retry if we become leader
...
If an instance is first to start, it's retry should be canceled
when other members prod it to become leader.
2018-10-01 15:29:18 -04:00
Jarrod Johnson
61e7c90ad1
Do not restart on intentional kill
...
Additionally, add some output to help filter events log
2018-10-01 10:32:55 -04:00
Jarrod Johnson
e57cdf9a7b
Add more collective event log handling
...
More detail to analyze how the collective membership is handled.
2018-09-27 15:15:05 -04:00
Jarrod Johnson
10ce7a9de9
Add more logging to collective process
2018-09-27 10:51:06 -04:00
Jarrod Johnson
0724ad812b
Add logging to the assimilation phase of collective
...
When attempting assimilation, provide logging about the attempt.
2018-09-27 10:51:01 -04:00
Jarrod Johnson
a3b0b0240d
Abort assimilation attempt on non-member cleanly
...
If a confluent instance has forgotten the collective, more cleanly
handle the situation, and abort the assimilation rather than assuming
the peer should be leader, unless txcount specifically is called out
as the reason.
2018-09-27 10:50:54 -04:00
Jarrod Johnson
784e4bed2f
Force cleanup if follow thread dies of exception
...
If something killed a follow thread, it was not always able to fire the
recovery off. Wrap the risky code in a try statement.
2018-08-20 15:02:34 -04:00
Jarrod Johnson
f0edbbad39
Have collective show present some info when not in quorum
2018-07-20 14:11:38 -04:00
Jarrod Johnson
5cf1671350
Make the takeover process more deterministic
...
Try to avoid submitting to be a follower while we are currently
becoming a leader
2018-07-20 13:50:42 -04:00
Jarrod Johnson
e5c4219ee9
Reorder certificate check
...
First order of business is to verify certificate before even thinking
about if the request is possible
2018-07-20 13:34:14 -04:00
Jarrod Johnson
a1ba5f59a8
Fix collective show on non-collective
2018-07-19 17:21:01 -04:00
Jarrod Johnson
9bcca6bfad
Provide collective show on all members
2018-07-19 17:08:20 -04:00
Jarrod Johnson
54d93571d1
Have leader provide more data in collective show
2018-07-19 16:26:05 -04:00
Jarrod Johnson
f2f902de7b
Have collective show report when collective inactive
...
Collective show was misleading if not in a collective.
2018-07-19 15:59:15 -04:00
Jarrod Johnson
a09792f969
Schedule periodic attempts to restart collective
...
If collective is lost due to connectivity, this will cause
occasional attempts to bring it back.
2018-07-19 15:49:05 -04:00
Jarrod Johnson
7d16c943a8
Handle updating address of collective member on connect
...
If a collective member changes its IP address, update at the next
possible opportunity.
2018-07-19 15:24:08 -04:00
Jarrod Johnson
497ca40492
Do not abort connecting process on bad cert
...
The target may be non-viable, but don't let that ruin the party
for everyone. Let it keep going as if the system were down.
2018-07-18 14:58:16 -04:00
Jarrod Johnson
fc5472065a
Catch missing '@' in token as invalid token
2018-07-17 11:46:40 -04:00
Jarrod Johnson
1dad69097b
Be consistent with sync during load of leader cfg
...
Pass through sync as appropriate.
Also changes meant for previous commit
2018-07-13 21:52:17 -04:00
Jarrod Johnson
fd7c428d1f
Cleanup leftover sockets and more reliably be following or leading
...
Before there was a chance to be in a half state, leading to an inability
to reach consensus on leader.
2018-07-13 21:20:42 -04:00
Jarrod Johnson
c74fdf5924
More collective join errors
2018-07-13 11:07:39 -04:00
Jarrod Johnson
58bf226d23
Relay error from server about token issue
2018-07-13 10:50:17 -04:00
Jarrod Johnson
c80ebb0e8d
Explicitly close connection before replacement
...
If an existing follower is stalled out, close the socket explicitly
to avoid leaving it open in lsof.
2018-07-13 09:14:36 -04:00
Jarrod Johnson
efaf1dae70
Make cfgleader modifications more robust
...
If cfgleader is about to forget a socket, explicitly try to close
it first.
2018-07-13 09:05:28 -04:00
Jarrod Johnson
7cdc3c1400
Implement clear config rollback
...
Should something go awry during config
load, rollback the clear and load.
2018-07-12 08:48:21 -04:00
Jarrod Johnson
beedfb0600
If a drone doesn't exist, treat it as if it's an invalid certificate
2018-07-11 16:29:45 -04:00
Jarrod Johnson
ce59a36351
Avoid excessive syncs on connect
...
This removes some redundancy and avoids writing and loading to disk
during the initialization process.
2018-07-11 16:07:56 -04:00
Jarrod Johnson
8e9bcbb44f
Clear txcount on enroll
...
The transaction count on 'join' was being honored as high, when
it never should be.
2018-07-11 09:40:22 -04:00
Jarrod Johnson
704aaeecf9
Tolerate newline in myname
...
vim is quite insistent on adding a newline, tolerate that.
2018-07-11 09:36:51 -04:00
Jarrod Johnson
11968faffc
Numerous fixes to collective
...
If client has higher transaction count, do not close the connection
before extracting peer address.
If our connect session is rudely terminated, abort rather than trying
to continue.
On assimilate failure, ignore a failed assimilate with no data.
Fix problem where a follower getting double deleted was causing an error.
2018-07-10 14:55:57 -04:00
Jarrod Johnson
298e11f60f
Allow invite from non-leader role
...
A non-leader transaction is modified such that the enroll node
can be connected to the leader and have validation.
2018-07-09 16:40:43 -04:00
Jarrod Johnson
67d6e9a6c7
Add collective show
...
Provide a harmless way to look at collective state
2018-07-09 15:07:24 -04:00
Jarrod Johnson
2342fe717e
Remove superfluous call to sync to file
...
load_from_json already makes the call, remove the extra call that is
redundant.
2018-07-09 12:59:37 -04:00
Jarrod Johnson
1eaf5357ca
Resolve race conditions on simultaneous collective outage
...
Implement random backoff strategy for serializing connect out and
connect in.
2018-07-03 14:09:09 -04:00
Jarrod Johnson
956faee052
Correct typo in variable name
2018-06-28 14:12:56 -04:00