Skip to content

ETL to QE, Update 37, I was just trying to reinvent event driven development

Date: 2024-05-27

One of the core features of CGFS is a event log providing absolute provenance. Each application is a stream of events with the application state either being stored in those events or the data is ETL(Export Transformed and Loaded) into another another data format such as a SQL relational database.

Kafka has a lot of design features in common with CGFS,

  • KSQL is used transform data from streams so they can be easily queried
    • This is similar to how one can query collections using RxDB which backs CGFS
  • Kafka Connect is used store streams in different databases
  • Kafka Schema Registry that forces events of a specific topic to follow a specific format.
    • This is similar to how RxDB, which is used to back CGFS requires data be validated via JSONSchema before being inserted in a collection
  • Kafka Compacted Topic does not store all the events in a topic only keeping the latest events with each ID
    • This is similar to how within CGFS when one upserts the system checks for an existing entry in the database, it it finds one, the CID is included in the updated entry

There is one key difference between what I am trying to do with CGFS that Kafka does not have. Kafka is hyper scalable. A service is supposed to just dump events into the Kafka cluster to be consumed by people that want to use them. CGFS requires a provenance chain therefore the next event in a series of events should include the CID of the previous event. Therefore there is no partitioning a topic(event stream) in CGFS.

Kafka may not produce a provenance hash chain of events within a topic but it does have a incrementor that places every event in order. The last CGFS prototype mentioned in , ETL to QE, Update 34, Failed CGFS Experiments specifically the dentropy/nostr-nip05-server the one core design mistake that made me want to rewrite the entire application was the fact that there was no ordered provenance chain.

I have been spending a lot of time thinking about the problem of CRDT(Conflict Free Resolution Data Type) for my CGFS project. An example of a CRDT is when you have conflicting GIT branches, two different versions of a word document, or a bunch of messages in a group chat that failed to send and are sent at a later time messing up the order of the chat. I guess the solution for this should be to include an optional key value pair in CGFS Collection - COLLECTIONS called crdt_cid for merging out of sync collections. So collections shared peer to peer should have the same collection ID on their representative device but should have some sort if DID or PeerID attached to it with a separate CGFS Collection - COLLECTIONS being used to track has what data. I guess this may also have the Signal problem of of async cryptographic communication solved by Double Ratchet Encryption .

This is the second area where CGFS diverges from Kafka, if you event is not received by Kafka and placed in a topic it does not exist, no going back and redoing it without playing back a new event stream. Kafka is an append only system, just like a blockchain, and how Git is supposed to be used. The oder of events is primary how Kafka Consumers function, CGFS does have the ordered events for provenance sake but applications using CGFS are primarily supposed to use the index ID or queries over the data to make sense of it. Within CGFS when we do have an event log we are supposed to increment the ID of indexes being streamed into RxDB rather than rely on the provenance chain.