Bitcoin Data Storage Model (BDSM)
Ok, so maybe the name can change... This is going to be a very high level overview of an idea I have for on-chain data storage. I'll keep details sparse and assume technical minded folks can extrapolate the details, use cases, and impacts more (mainly because I'm awful at writing docs). This idea is about 2 hours old as of writing this and I'm just going to mind dump my thoughts as I go and publish it. God bless whoever reads this.
The basic idea is to explode a JSON object such that every key:value pair is a separate transaction. You would start with something similar to a JSON Schema Definition, except each element would contain an additional property, a public bitcoin address. So every path in the JSON file maps to a specific bitcoin address where only that paths values will be sent. This will act like a columnar store database allowing easy aggregation and searching. This also means anybody can create their own view of the data by defining their own Schema Def that only includes the smaller scope of fields they care about for their application or even web-view (think search results vs full detail page). Eventually with libraries developed you'd just point the library to the Schema Definition, and pass a query for it to fetch your custom data view. That's my high level ramble of what I'm thinking, so if any of that sounds interesting you can proceed to the other sections.
- Easily Indexed on any 'column' that is relevant for an application
- Easily sub-settable
- Can listen to 'topics' (sections) you care about similar to kafka by monitoring an address
- Less parsing, less data retrieval
Defining a Record
Disclaimer: I'm a Software/Data Engineer, and have limited understanding of TX's so some details might be off, but there is always a technical solution out there. This is just a starting point.
I had two ideas for this, I'll present you with my first idea and why I think it's bad in case anyone else thinks of it. Might save some discussion.
- Create transactions for each element in the object and hold the resulting TX id for each.
- Get an address from the Schema Definition that acts as some kind of "Record Registry"
- Send a TX to that address with basically a JSON object with all the TXs from step 1 substituted in.
Now, why I don't like this approach. At first it seems really clean, but then you realize all you are doing is kicking the can and you might as well just have JSON still since you eventually have to go back to a JSON object, pull the entire thing, and parse it to find a cohesive record. This doesn't give us any real benefit other than some possible minor optimization dealing with a more compact record of pointers.
The better method is as follows, and achieves the original goal of not having a centralized data point for the object. Implementation might be trickier and rely on wallet developers to enforce since it's convention based.
- Whenever a user goes to insert a new record, a new unique address must be used.
- That one time address is the PK for all the 'column' level transactions.
- The record is the public address used.
With the above approach we can now query strictly on the transactions and not parse any objects in the stored data. This also gives us another benefit of mutable/versioned records if desired. Since the address is the PK we have a versioned record where we just read the most recent 'column' TX from that address. When inserting a new record we generate a different address to serve as a new PK.
On-Chain SQL-like Queries
This should now allow us to perform some SQL-like queries on chain. I say SQL-like because obviously this isn't just rows and columns, we have nested objects potentially. However there are already standard ways to query JSON that's stored within databases, so just imagine a query language similar to that. For now I'll just focus on a very flat schema for illustration of the performance benefit.
Let's say we have the typical Comp Sci Employee table everyone knows and loves. Before if each employee was stored as a JSON object in separate transactions, and we wanted to find every employee who is male, we'd have to do essentially a table scan in database terms. For those non db focused readers, this means we'd need to read every single employee's full JSON object on chain, parse it, and check for sex==male.
Instead we can now just go to our schema definition, take the address for the sex column, and query transactions at that address where op_return='male', and we are given a list of 'PKs' for the male employees. If all we wanted was a count we are done here, if we needed more information such as the first and last name, we would also pull the first and last name addresses and do a JOIN where the from TX addresses match, and the sex op_return=='male'
Just want to call out here a few things. You are limited what you can do with completely on-chain queries. Most likely that data would be considered a data warehouse or data lake. Realistically, most applications will probably subscribe to 'topics' and ingest parts of that data into their own database for indexing purposes. This may only be a table of the PKs and various columns of interest for them so that each one can be indexed efficiently. Also, with very large datasets you will be able to make use of compression and other features like we do in Apache Parquet which wouldn't be possible with on-chain immutable data for obvious reasons. The point is, even if the on-chain queries aren't possible, there is a very real ingestion/processing gain from this format as well.
Not actually a conclusion, but I just wanted to say there is plenty more that could be elaborated. But, before going down the rabbit hole more, and making a bigger wall of text to scare off readers, I think this is a good stopping point to open things up for discussion. Maybe the idea is terrible and I save myself some more typing. Sorry for the disorganization, I've always had trouble putting my ideas onto paper, going from abstract ideas to a document is hard, it's easier for me to just write a PoC typically. Thanks for reading.