Names - X Datastructure
- Agoric (from agora, a shared space for announcements or markets)
- Collaborative (facilitates collaboration between participants)
- Collective (advantageous
- Decentralised (no central truth, master or authority)
- Democratic (every individual is allowed to input and converges towards consensus)
- Distributed
- Integrating (integrates individual views in interesting clusters)
- Joined (integrated, social, collaborative, join-semilattice)
- Scale-free (operates on any scale, small scale and web scale. i.e. compatible with topics whose popularity follows a power law)
- Shared (perhaps pretty much spot on)
- Small-world
- Social (sharing is a social activity; sharing happens in a decentralized way, between socially/loosely connected agents)
- Subjective (datastructure fragments are tagged with subjective mental states)
- Stigmergic (shared environment allows stigmergic developments)
What we intend to solve
- Scaling up social interaction.
- Deal with geospatially distributed shared environments
- Respect individual differences
- Get the most of our individual differences
Algebraic properties of incremental datastructures
Incremental datastructures are datastructures A, that have a delta type B, together with a merge function f(A, B): A, and operations Gi..Gj: A => B, s.t. for any a: A and i in N
merge(a, Gi(a)) = a', where a' denotes the application of Gi to a. Gi(a) does not denote application, but delta-derivation; merge denotes application of delta's.
merge can have several properties:
- associative
- commutative
- idempotent
algebraic properties of functions compose, they form a meet-semilattice; For any functions, f, g, the algebraic properties of their composition f.g is:
props(f.g) = props(f) intersect props(g)
Supported datastructures
- Primitives: Boolean, Integer, Float, String, Date
- Monotonic primitives: Asc/Desc versions of Boolean, Integer, Float and Date
- ADT: Product, Coproduct, enum
- MVRegister
- Special: Interval + IntervalTrie + IntervalMap
- Collections: Set, Bag/MultiSet, Ranked Bag, List, Map, Matrix, Spreadsheet, Monotonic DAG, Graph
Higher datastructures
- Mindmap
- Argumentation tree
- Group budgeting (combining rank and quantitative aggregation)
- Source code / git repo
- Rank aggregation / collective list ordering
- Filesystem (immutable blob support)
Monotonicity
Eventual consistency is given for monotonic datastructures whose updates eventually distribute to all participants. However, there are many possible implementations, we should set principles that allow us to narrow the size of this set:
Agent-oriented
We state that it is important that each agent can reflect its own mental state, potentially building on top of a predefined collective state, but ultimately redefining it as far as required by the agent, without being influenced/disturbed by other agents. This corresponds to forking in git. However, an agent should be able to experiment and therefore start multiple potential versions from a given base version. This corresponds to branching in git. These two concepts are only distinguished by some unimportant implementation detail and therefore we treat them to be the same. Let's use the term branching in agent-oriented datastructures for the starting of an alternative reality, given a base reality. Git takes a pragmatic approach to integrating these branches, we separate them by forks that have some authoritative party that decides what branches are ultimately mixed into there. We take a different approach:
First, each agent A decides for itself a (join-semi)lattice of qualifications La. La consists of a set of labels, L, a partial order P over L and a merge function m, s.t. for any two l1 and l2 in L, l3 = m(l1, l2) ==> l1 <P= l3 and l2 <P= l3.
Second, any agent A with qualifications La may include in his definition the qualifications of Lb of another agent B, as long as the ordering is respected. Conversely, one could see this as one big (join-semi)lattice, but both nodes and (partial-order) edges have a single owner.
Requirements
We build some context for our requirements by showing how we intend to use our datastructures. We'll proceed by listing and explaining the requirements
Collaboration domains
Here we list the requirements that we have for collaborative datastructures in various domain of collaboration:
Software engineering
- Mark a prefix of history with a certain label, such as (main/test) compiling, passing a (set of) test(s), acceptable, which is interesting from the perspective of your own work; mark the history with a label of another agent as interesting.
Valuation := Set[(Agent, JudgmentLattice)]
JudgmentLattice := (forall a: Agent) LeastUpperBound(a.lattice)
Judgment := ManualSubjectiveJudgment | AutomatedObjectiveJudgment
Valuation := LexProductList[Map[Quality, Map[Set[Agent], Set[Error]]]
Quality := {Compilation(Class), Test(Name)}
Assumption - Quality improves over time:
- Adding a quality (even if it is a non-compiling class, or a failing test) on itself is better than not having the quality. However, qualities can be ranked too; compile-fail class < compiling class and failing test < succeeding test.
- Removing a quality (class, test) is better than having a quality (compiling class, succeeding test)succeeding test)
- LexProductList accounts for adding a non-compiling class implying not being able to run the tests. With respect to continuous integration, (almost all?) changes are an improvement, but with respect to testability or releasability, only an increasingly small subset is. The simplest thing that models this is a linear pipeline.
Complications:
- Objective pipelines; ideally we want to run them once and be done with it, however, we may not trust the results that other agents claim they obtained when they run theirs
- Pipeline differences; Some agents may incorporate additional quality metrics, such as linting or test-coverage.
- Subjectivity; Linting configurations and acceptability are subjective. Linting configurations may not cleanly merge either (due to opposing rules being acceptable to some). Whether a particular test-coverage (delta) is acceptable depends on the agent, and possibly even the context.
- Integrate vs. Fork: Sometimes integration of developments is not the best and forking really is the better option. Ideally, forks can benefit from beneficial developments that are compatible with both sides without incorporating changes that are only compatible with the other. Granularity is key here, but isn't always easy.
- Prioritization: Certain new features are nice to have, whereas security fixes really ought to receive priority when choosing what to integrate.
Document/Wiki editing
- Allow anybody to edit
- Label edits based on author maturity/stake: Derive author maturity from the edit age; sum of character-add/remove age.
- Accept edits of a certain quality label
- Review edits of a certain quality label, and add to their evaluation. Alternatively, this can be seen as an author explicitly merging a change
CRUD
- model changes
- entry changes
Principles
- Participants develop their own partial order of quality.
- A partial order of quality may consist of checks that are manual or automated and subjective or objective.
Search space prioritization
We can limit our vast search space of potential increased value for arbitrary combinations of input through several means:
Input qualities
Merging two updates with certain qualities leads to a state with another set of qualities. Unfortunately, one cannot say in general what qualities merging arbitrarily qualified inputs will result in. Most likely one obtains something in between the union and intersection of the input qualities. The extremities seem less likely and therefore can serve as heuristics for finding potential qualitative merge-inputs:
- Merging two updates is unlikely to produce qualities that the inputs didn't have.
- The larger the number of qualities in common between two inputs, the more likely their merge is to retain some of them
- Qualities in common are more likely to be retained than unique qualities.
- Merging two inputs that only differ by 'small' amounts are more likely to retain qualities than inputs that have 'large' differences.
Popularity
It is good practice to focus on producing value in short iterations; the increased feedback is highly useful in improving the quality. Given that the majority of people will be exposed to one of the pareto frontiers, testing your newly devised mutations against those should receive priority over testing it against inputs at arbitrary depth away from this.
Modeling subjective valuations
Agents may use their own partial order of valuation. In order to do so, an agent specifies the type of its partial order. Partial orders can be built up from:
- Ordered primitives: boolean, int, float | descending or ascending
- Products: combined partial orders of equal weight
- Lexicographic product: combining partial orders in order of importance
- Coproducts: ordered alternatives
- Set[A]: Descending or Ascending in cardinality
- Map[K, V]: Descending or Ascending in cardinality
- Each agent valuation may have define its own structure
- Each agent may update its valuation structure - This is easy. Make sure valuations are indexed by a hash over the [User]:[Schema]. This way it is essentially the same as dealing with a new user.
- Users may leave the valuation: Do nothing, typically the stable root will form a new frontier, with a new main root coming out with a frontier without the valuation of the user that left.
- Users may join the valuation - How we deal with this depends on our implementation of users joining at all
- Is there consensus before a participant can participate? That would simplify a number of things, but may invite authority over the datastructures...
- If users can join concurrently, they may also propose valuations over old data that is already compacted.
Valuations vs. Argumentation
Valuations allow some of the features that we aim to place in deliberative structures as well. Two valuations can be opposite to one another, and the result may be that a edits lead to alternating values that satisfy or the other. A deliberative datastructure could be used to settle the value between participants.
Compactions and availability
Can we combine compactions with availability?
Availability: Always be able to operate on the local data-structure and be confident that your produced deltas will merge into whatever state exists elsewhere.
Compaction: Once in a while, we would like to clean up data that doesn't seem to serve much purpose. Some metadata is only there for extreme availability; A majority of nodes may have developed common knowledge about a certain prefix of history, which seems like a good moment to compact that history into the common state perceived. However, compaction complicates the scenario that a node with a strict prefix of that history and some concurrent edits can merge those edits back into the common state:
The proposed datastructure could look as follows:
- Insert-only DAG of compacted states: These states are inserted when a majority of nodes develops common knowledge of the state of the raw DAG. They extract state with monotonic metadata overhead from the main state and build a compact representation in a deterministic way.
- Insert only DAG of raw states (S, Valuations)
- Gossip graph: in line with PARSEC, this is used to establish common knowledge; knowing of one another what we know, in a majority of nodes. Once a majority has seen that the majority has seen a particular edit, we know it can be safely compacted.
- Valuations
Edits are structured as follows:
- Causal context: What edits preceded this one, either remotely or locally?
- Dot: A pair of replica-id and edit-identifier that is unique w.r.t. replica-id (usually drawn from a monotonic sequence; either one for the replica, or one per replicated datastructure)
- Delta: The change to the data-structure
Compaction can merge edits and will retain causal consistency, if the set of edits together form a causal consistent update with respect to some causal context. We ideally only retain the delta and drop the causal context (this one is clear by placing in the compacted DAG, it is the parents of this node. The trouble is that once you strip the original causal context, nodes that operated on a particular prefix of the merged delta now have a reference that fails to resolve for all nodes with the compacted version, i.e. they cannot establish:
- whether the edit has already been applied
- if it has been applied, which compacted node contains it
- if it has not been applied, which compacted node it should be added to (although, if we include the recent-compact-node identifier as part of the causal context, we gain some information, and the newest compacted-node should be fine to add it to, but obviously not if it was applied already)
Conclusion: we can do compaction by merging delta's, but cannot purge dot-vectors from causal contexts if we want to retain availability. We may choose to give up availability after some timeframe, after which we purge dot-vectors from the causal contexts.
Can we purge dots from dot-sets? I would think that just retaining the causal context is enough, because we're mostly interested in knowing whether or not an old edit was already applied. Dotsets inside the datastructure may be re-written as events drawn deterministically from a sequence for that compaction node. This might work for adding elements to a set (although it would create a concurrent dot, which isn't right) but certainly would fail for removing elements (then again, the current proposal would lead us to having insert-only structures that process removals on compaction, so not removals in the delta-CRDT sense)
Designs
Valuations are overkill for EWFlag, but useful for:
- MVRegister: Arbitrary checks on an arbitrary inner value(s)MVRegister: Arbitrary checks on an arbitrary inner value(s)
- AWSet: Arbitrary checks on arbitrary inner values
- ORMap:
Partial replication
Part of a planetary scale datastore is supporting partial replication, which constrains our causality tracking mechanism.
There are version vector designs that have some tricks for the compact representation of causal context. Events come in a pair of replica-id (globally unique) and event-id (unique for that replica). Rather than storing a set of event identifiers, one can compact a contiguous set of dots with an interval, and a contiguous set of dots from 0 as just the maximum number within the interval.
More compaction can be obtained by having a node-specific counter and causal context rather than a datastructure-specific one. The datastructure specific one is still necessary in case the datastructure context is not extrinsic to the node's causal context.
Imagine two nodes n1 and n2 sharing two datastructures D3 and D4, but not sharing a third topic D1 and D2 respectively:
Challenge 1
Have a compact causal history even if n1 and n2 alternate their writes between D3 and D4. The causal history might look like:
D3 = [(n1, 0), (n1, 2)]
D4 = [(n1, 1), (n1, 3)]
A sensible compaction seems to be:
D3 = [(n1,2)]
D4 = [(n1,3)]
However, the compaction does not allow us to distinguish between the case where n2 didn't receive the earlier events ((n1, 1) and (n1, 0)). We can distinguish that case by marking such exceptions:
D3 = ([(n1, 2)], [(n1, 0-1)])
D4 = ([(n1, 3)], [(n1, 0-1)])
At this point, we don't know which dot belongs to which key, so both are exceptions and we write an interval to compact them. However, in many practical cases an interval may have more overhead than benefits.
Constraints:
- Optimise for the typical scenario: 1 user per host
- Will support multiple sessions per host, because we expect this software to function as a local CDN
- Will likely never support session guarantees across hosts, simply because this doesn't align with our goals and it goes counter to the primitives that we have chosen to acquire those goals
Design