Skip to content

ETL to QE, Update 27, Meme Schema Roadmap to Implementation

2024-02-15

As a reminder the plan to MVP is,

TLDR; Memes, Schema, Tokens, Merkle Trees

  1. Memes = CGFS Collection - MEMELET_MODEL
  2. Schema = CGFS Schema - Persona
  3. Tokens = QE - Token Specification
  4. Merkle Trees - QE - Proof of Meme

Okay so what is the ELI5 for the CGFS Collection - MEMELET_MODEL

All messages have the same characteristics,

  • Mentions
  • Reply to message
  • To Identities
  • Time Sent
  • Title
  • Content Type
  • Content
  • Raw Tagging (Hashtags)
  • Reactions
  • Read Recites

Then there are complex messages features such as,

Individual memes are atomic but memelets are not. A version controlled note, like this one, must be treated as a memelet rather than a meme.

The content must be separate from the timestamp.

The mentions must be stored as DIDs

Now all memes have a title, but some messaging systems require it.

Can the title be stored in the metadata?

Yes.

Can the timestamp be stored in the metadata?

Yes

Do we separate out the required metadata from the operational metadata?

The timestamp is part of the identity, that is Phase 2.

Alright a meme requires 3 components, a type, it's content, and metadata. THAT IS IT, everything else is optional.

Good now we need JSONSchema for that.

Ah we need one more thing, a version number. So we know what it is

We can encode a version number using the CID identifier

No we need it in the JSON

Do we just reserve the root key dd?

Sure, so we now have 4 things, a type, a version number, content, and a key value store.

Yes that sounds about right.

What keys do we use for these items?

  • QEVersion : String
  • type : Object
    • name : String
    • version : String
    • JSONSchema : JSON
  • content : String
  • data : Object

What does the JSONSchema look like?

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "CustomSchema",
  "type": "object",
  "properties": {
    "QEVersion": {
      "type": "string"
    },
    "type": {
      "type": "object",
      "properties": {
        "name": {
          "type": "string"
        },
        "version": {
          "type": "string"
        },
        "JSONSchema": {
          "type": "object",
          "additionalProperties": true
        }
      },
      "required": ["name", "version", "JSONSchema"]
    },
    "content": {
      "type": "string"
    },
    "data": {
      "type": "object",
      "additionalProperties": true
    }
  },
  "required": ["QEVersion", "type", "content", "data"]
}

Source

Nice now we need to standardize the other data, let's start with the DID's.


did:dd:twitter:$USERNAME

did:dd:keybase:$USERNAME

did:dd:discord:$USERNAME:$USER_ID

did:dd:facebook:$PROFILE_ID

did:dd:instagram:$PROFILE_ID

did:dd:matrix:$MATRIX_ID

did:dd:nostr:$NOSTR_ID

did:dd:apub:$ACTIVITY_PUB_ID

did:dd:email:$EMAIL_ADDRESS

did:dd:hypothesis:$HYPOTHESIS_USERNAME

did:dd:phone:$PHONE_NUMBER_WITH_AREA_CODE

I just took a look inside the data Facebook gives you when you request it. Turns out it does not include a link to the users Profile. Therefore in order to index everything we will require IPNS names.

Is requiring a private key for each identity a good idea.

It's either use a UUID or use an entire private key.

Ya I like using an entire private key.

So every person I would index would get their own private key just to index them.

Ya that sounds about right.

Okay now we need to come up with out types.

I think the solution is to ETL each message type.

What do you mean.

Find better examples of each message type, then get a JSONSchema for each.

That sounds like a good idea.

Then we need to define how the JSON gets transformed.

Ask ChatGPT Search "Can jq transform data?" Result How To Transform JSON Data with jq | DigitalOcean

Okay seems like jq will be all we need to get the data from whatever JSON we start with to the standardized format we like.

Alright how we we know where we want to be we need to articulate the path to get there.

We want to put all my messages in the same NDJSON meme format.

The raw messages will require a JSONSchema for their traditional message type, then a series of jq transformations in order to get it into what we need.

So does this mean we stop treating my personal data archive as a bunch of files I access via mega-cmd?

That's a good question. The end goal here is to replace MEGA with a file system that supports hash based identifier, path mounting via fuse, encrypted sharing, shared folders, real time updates, etc. etc. etc.

Okay writing an entire file system or upgrading/using PerKeep Mintter.

Actually IPNS and CID's might be all we need except for the pubsub. A CAR file is a copy of what gets mounted to your file system right?

Okay what is the use case here?

I want to share a presentation that I edit with other people.

That's just not happening with Libre Office. The file gets saved and passed around.

What about sharing a ebook library?

The ebook library should be DAG-JSON and link to CIDs. When someone does not have the specific book they request it from other people on the network.

Ah so we just need to resolve every CID in the DAG-JSON.

But that's not writing out to the file system.

That's just an implementation problem. A simple script that reads every author+title can then output to the file system.

What about my photo library?

That should be DAG-JSON as well except it should be chunked just like large chat logs.

What about my facebook exports and stuff?

Each export should be a CAR file and all the CAR files should be managed using DAG-JSON.

The key thing to understand here is that we want to now refer to the same file by its hash in multiple places and we can't currently do that.

Alright replacing mega-cmd is out of scope, that can be resolved in the next step, Schema = CGFS Schema - Persona. Oh also if I just dump CID's to the linux file system I can just symlink them where I want them whahahaha.

Alright let's take a look at where we want to be? Catechism - CGFS Meme Model

Nice now what are the steps to get there.

  1. Get better demo JSON Data for each of these platforms.
  2. Define a JSONSchema for the OG data.
  3. Save all these JSONSchemas somewhere and refer to them via CID for now.
  4. Write ETL scripts for each data using jq
    1. WRITE DATA GENERATORS
    2. GET EXAMPLE DATA
    3. WRITE TESTS
  5. Reindex all the data I posted above without the need for cryptographic DIDs
    1. Check if we can use a Helia datastore, check out What is that IPFS S3 JS javascript Library?
  6. Upgrade JSONSchema to use Cryptographic DIDs rather than raw CIDs.

There is one part of this plan that does not sit right with me and that is the possible use of Mutable File System and IPFS UnixFS. These abstractions are separate from Multiformats that are not compatible with other programming languages. We assume with confidence that DAG-JSON and friends are a solid base to build up from. This MFS and UnixFS stuff doesn't seem as well supported.

Can IPFS UnixFS in Javascript be exported to a CAR file?

Okay so UnixFS is file and actually portable because CAR files can be extracted and dumped onto an actual UnixFS.

What is the difference between IPFS MFS and UnixFS?

Alright so MFS solves the problem of updating data. MFS updates the file system irrespective of Identity and context preservation.

Actually let's test exporting it using CAR.

Aaaand a couple hours later what do we got?

Seems like IPFS IPLD CID Tutorial and How to use Firebase to host IPFS static site? is going to have to be updated with UnixFS.

Okay let's define a goal here.

So CGFS Schema - Persona is going to be a series of IndexedDB key value store. There does exist the IPFS Stores

Should we care about unixfs in browser after, this and this happened.

What did we want to do with unixfs anyways?

I want to send users CAR files rather than trying to sync leveldb.

Car files have a problem because they can't be chunked or easily loaded into a browser. We don't want to pull a memex.garden with the glitchy sync client.

Plus the entire idea of CAR files does not work S3, it may work for Filebase providing a path for each file within the CAR file but definitely not for S3, the CAR file would have to be extracted to S3. What we learned here is that CAR files are great in a Unix environment but suck in the browser.

UnixFS doesn't even work for the complex file permissions we want to add to things.

The best thing to do would be to open a web socket and have a conversation asking for everything a DID at a time, treating other people's levelDB as a file system. Not with SQL or something just raw levelDB.

Can DAG-JSON actually resolve using LevelDB?

Nope that's up to us to resolve that, that seems like code I can write. It should be simple once the user journey for all the data is defined.

Okay what now?

Next is get better examples of the JSON for each social media platform stuff in a git repo and explain how to transform it.

Also for Raindrop which is CSV we should just use sqlite or maybe duckdb. SQLite is definitely far more universal, I can just use a python script that uses native python.

CAR files are like ZIP and TAR files they are for archiving not every day use. Also one can publish a CAR files to S3 by just extracting it and copying the directory over to S3.

Okay that's solved so what exactly is the problem?

But don't we want to do all the message format research in Obsidian and not in the repo itself?

Yes, what is the purpose of the repo then?

It is supposed to have the ETL procedures, the recursive jq commands to get the data into the required format.

Yes that's what we want.

Alright let the research project begin.

What about the previous research projects?

Alright all these projects each need their Heilmeier Catechism. They should not truly begin until that is complete in at least some form that you are willing to put on your public blog and point other people to.

Okay I like that requirement. And we have already started that,

Catechism - CGFS Meme Model

How to export a Indexedb LevelDB database from an Electron App or website or extension?

That is way harder than it is supposed to be. It would literally be easier to hook up to a localhost daemon via websockets.

How to generate a PGP key with a seed in javascript?

Can't Read Chromium IndexedDB

Wasted 6 hours looking into LevelDB what did we learn?

  • Chrome uses a special version of LevelDB that you can't read without compiling it from Chrome itself or loading up chrome and trying to export the data
  • There is no setting you can load into Chrome or Brave to force allowing loading scripts into sites see Chromium --allow-running-insecure-content
  • The best option is to load up puppeteer or Electron Software with the appropriate the load custom javascript into Chromium and then save as JSON
    • Download JSON Files
    • Read Javascript directly from console
    • Load into some server
  • Firefox stores their IndexedDB stuff in SQLite and it is hard to make sense of.

PGP in Javascript

We wasted many hours trying to figure out how to generate PGP keys in javascript.

Cool we have partially solved our PGP in JS problem and IPNS keygen problem at the same time.

Missing User Journey

Alright I feel like we are still missing a user journey......

Well we want to develop our own Nostr client but be able to play with the data client side Obsidian, SQL, drag and drop, file system style.

How would your mother use this?

My mother emails me stuff and keeps notes in her email. The stuff she would normally email me she is supposed to send me via QE. Then she and I can integrate all those messages into a wiki with context.

How are my friends supposed to use this?

Every person in the group chat now has a wiki page to themselves as well as a collaborative wiki.

This wiki part is not in your pitch deck from the other day.

Yes that will need to be fixed, added to the backlog.

Okay so what can we get done this weekend?

Develop my own Nostr client with React.

Support,

  • Event 0
    • Pseudonym / Username
    • Description
    • Bitcoin Lightning Address
  • Pseudonym / Username
  • Description

What else did we learn about?

Interrogate CAR Files

So manually creating a CAR file in browser even with all the examples was a bitch and a half.

I was thinking that CAR files could basically be mounted in a IndexedDB name space. Then I could manipulate them with the same old mv, cp, ls commands I am used to to.

CAR Files verses Custom IPLD Structure.

So turns out that CAR files are just IPLD protobuf storing binary blobs at the end attached to file names.

Sources