The Data Provisioning Layer for Decentralized AI
Grass is creating a new revenue stream for every single person with an internet connection
Welcome to the alpha please newsletter.
gm friends, I’m back with another interview today and it is a really exciting project which will be launching on mainnet in Q1/Q2.
It combines many different bullish verticals DePin + AI + Solana. If you have been following me for a while then you will have seen me talk about Grass, but today you get to hear from the Grass co-founder 0xdrej, and he dropped a lot of alpha. It’s a long read, but a really fascinating one.
Grass already has over 500k users. When the Grass network goes live, it will be one of the largest crypto protocols out there, just in terms of sheer user numbers.
Grass is creating a new revenue stream for every single person with an internet connection.
We go through what Grass is, how it works, why they chose Solana and a lot more.
I will put an invite link to join the network here (just download the chrome extension to earn Grass points) and at the end of the newsletter.
How did you get involved in crypto? What brought you to the space?
Yeah, I guess my early journey to crypto was one of many missed opportunities. As I guess it is for many others. I first heard about it in high school, because some guy in my class was mining it on his laptop, Bitcoin that is. I haven't heard from him since then, but I'm sure he's doing quite well. And I actually participated in a Doge faucet in 2014 when that first launched, but I lost access to that account. So I guess those are two big early experiences with crypto, but I didn't really get into like any of the r&d until maybe a few years ago, when I started playing around with DeFi.
I spent a while working in finance, and was pretty familiar with all the traditional mechanics of how that industry works. And it was really exciting to watch a bunch of just like, normal people rebuilding an entire infrastructure on the blockchain. And you know, like, it's pretty crazy how many parallels you can draw from TradFi to anything happening on chain, even beyond DeFi, to be honest, just because of the fact that it's a massive immutable ledger. So yeah, I got involved in a few DeFi protocols a few years ago. While I was doing that, I came up with the idea for Wynd and I guess now we're here.
What’s the elevator pitch for Grass? How would you explain it at a high level?
So we like to call it the data provisioning layer for decentralised AI. What that really means is we have a network of now over half a million web extensions, that are kind of just crawling the public Internet, taking snapshots of websites and uploading them to a database.
The idea here is because we can actually parallelize and distribute all of this computing power, as well as these residential views of the internet - which are important, because websites will generally show a consumer what they want to show the public, as opposed to a data centre or a traditional product - we can actually create data sets that are not exactly possible to create in other repos like common crawl, for instance.
So there have been a few comparisons. One of them is like, you know, a decentralised oracle for AI and other ones are a decentralised version of common crawl. But yeah, at the end of the day, it's a massive data protocol that focuses on public web data.
So, by allowing anybody to participate in this network and integrating the blockchain, you’ve found that you can uniquely compete with existing solutions out there, correct?
So we experimented with a few different business models. Obviously, when you're building a protocol like this, you could just pay people for a little bit of unused bandwidth. You could give them a fixed rate per gigabyte, for example, and then use that bandwidth to scrape massive datasets, draw insights from them, and monetize those insights. From the scraping layer to the dataset layer to the insight layer, you capture a little bit of margin at each step.
Often, this is done by different entities and the user who provided the bandwidth, which powers all of this, only sees that tiny fixed rate per gigabyte, or often nothing at all, because they had like an SDK to install on a free app, and it's just cycling through bandwidth. We don't think this is fair.
We thought, okay, how can we create a value pooling mechanism that compensates users for this entire vertical? So if someone's inferencing an AI model with data that your grass node scraped, your grass node should be compensated for that, not just for the raw data. Hopefully that makes sense. But that's one of the big things we wanted to solve on-chain.
Another issue, which is becoming more and more prominent, is that of poisoned datasets. I don't know how familiar you are with this, as it's kind of an emerging problem, but it's existed in e-commerce for years.
For example, if you're scraping an e-commerce website like eBay, and you want to scrape the entirety of their inventory on a daily basis, you need to scrape around 30 million SKUs every day. eBay has learned that if they block your IP address, you'll just rotate them. So, what they do is they honeypot their prices. If they detect you're trying to scrape them and undercut them on pricing, they'll just give you fake prices. We've experienced this in the early days when we were playing around with grass and comparing it to using a data centre.
These e-commerce tactics have slowly flowed into ad tech. And then, since the explosion of LLMs in the last year and a half, it's actually flowed into the NLP [Natural Language Processing] dataset side of things too.
So, if you're a politician, and you know that a certain dataset is going to be used to train a model, you might reach out to the person curating that dataset and ask them to insert, say, a thousand sentences that favour a particular candidate. Similarly, companies are offering money to insert fake reviews into datasets that have already been scraped from the internet.
Now, solving this is very difficult, right? Because, as you might know, an LLM training dataset is not just gigabytes or terabytes, but petabytes of data, literally millions of gigabytes.
So, it's very unrealistic to expect anyone training an LLM to go and verify that the dataset actually came from the stated website. For example, if I claim I scraped the entirety of Medium, which would be something like 50 million articles, there's no guarantee that's actually the content that was in those Medium articles.
To solve this, zk-TLS (Zero-Knowledge Transport Layer Security) offers a great solution. This is something only possible on high throughput blockchains, to be honest.
The idea is once we are decentralised, these nodes, as they're scraping the internet, submit a proof of request. They submit this proof of request, and then our sequencers, which are currently centralised but we plan on decentralising, would delegate some amount of tokens to a smart contract.
This contract then unlocks as it receives the approval request. Now, you can actually link that proof of request to the web response from that scraping job, and then tie that directly to the dataset. Suddenly, you have cryptographic proof that shows these rows in this dataset actually came from those websites, and were scraped on this specific date and time.
That's quite powerful, because such a mechanism doesn't even exist in Web 2.0, and it's only possible using a blockchain.
Could you talk a little about what the “data wars” are and how Grass plays into it?
Definitely. So, as I alluded to earlier, the first industries to start walling off their data were actually e-commerce, as those were the most directly monetizable datasets at the time. As technology evolved, and as our understanding of language data became more advanced, this type of data became extremely valuable as well. The thing with language data, though, is that up until now, it hasn't provided as much value as it does currently. So, a lot of websites didn't really have ways to monetize this language data until recently. Then, they started realising how powerful this data is, and began walling off the internet.
For example, about half a year ago, Elon Musk started rate limiting Twitter for everyone because it was being scraped. Previously, Twitter didn't really stop web scrapers, but Elon Musk understood the value of Twitter data and wanted to train his own AI. This was something we predicted and it unfolded exactly like that.
Another example is Reddit, with all the restrictions they placed on their API. You might not know this, but two-thirds of the Common Crawl repository, which GPT was trained on, was actually scraped from Reddit.
Reddit didn't really understand how valuable their data was. It was particularly valuable because of the way the Reddit system works: someone asks a question, people answer, and the best answers get upvoted while the bad ones get downvoted. All of a sudden, you have a bunch of people manually training data that can go into a model.
We predicted a data war that is currently unfolding, where all these websites are trying to wall off their data. They're even adding backdoors for a very select few big tech companies, making AI inaccessible to the average open-source developer, which is a bit scary and poses a lot of centralisation risk.
Another great example is Medium. A few months ago, Medium's CEO wrote a blog post about web crawlers feeding Medium articles into AI models. He talked about poisoning those datasets, blocking crawlers, and making it as inaccessible as possible. That's why it's hard to browse Medium without making an account.
It's making the internet less usable for the average person, as companies try to wall off their data.
He also mentioned that they're letting Google access their data. The average person can't properly browse their website, but Google can crawl it to train their AI model for free. He explained why: Google will prioritise Medium in Google search in exchange for access. This shows how valuable it is to have a search engine, where you can pay for language data by prioritising SEO. That's the next big wave of these data wars.
All these companies are fighting over data, walling it off, trying to get the right price for something that's never been priced in human history. It's not fair that the average person is collateral, and that this data is only accessible to a few institutions.
The crazy thing is, there are incumbents right now scraping websites like Reddit by installing SDKs in millions of people's apps that they've downloaded for free. Say you download a Roku TV screensaver or some free mobile game. Developers are getting paid to put an SDK in there that allows these big corporations to use your bandwidth to scrape websites from your residential IP address because theirs is being blocked. The ironic part is we always agree to these terms and conditions; it's very easy to sneak them in there. Their justification, which is up for debate whether it's good or not, is, "Hey, you got an ad-free experience." So, they claim that's how you're being compensated. But we know very well that a few ads are far less valuable than the data being used.
Our philosophy with Grass is if there's going to be a data war, we might not be able to stop it, but we should at least have a chance to participate. We should have the option to sell the weapons in the data war or create a massive oasis of an open dataset for the internet that anyone can use to train their own AI models.
How easy is it for people to get involved with Grass and see some upside?
Right now the network is in beta testing, and it's quite simple. You have all the resources you need because the necessary hardware already exists on your device. All you need to do is get a referral code. Then you simply make an account, download a web extension, or the Saga-phone app, and you're up and running. That's all it really takes. The onboarding process is quite smooth.
One issue we've faced recently, though, is the number of users growing much faster than we ever anticipated. So, as we scale our infrastructure people might face some small issues.
To help people grasp the scale of things, what would you say is the size of this market?
So we actually target two verticals at the moment, or I should say three, each with varying market sizes.
The first one is the alternative data industry, which I believe is a $20 billion market. By alternative data, I mostly mean data used by hedge funds. For example, if you're scraping prices and inventory at certain stores, you can estimate a company's quarterly earnings. Hedge funds will pay money for this kind of information.
The web scraping market on its own, while still emerging, is a few billion dollars at the moment, but it's growing massively. The reason for this massive growth is because of the third market, which is AI.
It's very difficult to put a number on the market size of AI data right now. It's one of those things that probably grows exponentially day to day, and it's hard for us to price. But when you look at some of the numbers people talk about when discussing selling data for AI datasets, it becomes apparent very quickly that this is a massive opportunity.
Does Grass then become more valuable and competitive the more users you have?
Yeah, that's a great question, actually. The network does become more viable the larger it gets.
One analogy I can make is with hivemapper, which I think is a really cool product and idea. If you want to map the entire world but only have, say, 10 cars driving around, you'll get just a tiny fraction of that map. It might be useful for some very specific, small-scale applications, but not very broadly useful.
However, if you have millions of drivers mapping every road in the world, you can paint a much more comprehensive picture. You can then sell a much better product for a much higher premium, and the unit economics improve drastically for everyone involved.
Now, if you think about it, Grass is essentially mapping the whole internet.
So, let me give you another example of an application that isn't AI-related, but is in a massive industry – airfare, travel, and hotels. If you're a travel aggregator website, you want to scrape the best prices from every single provider, from every location. For instance, the price of a flight from Berlin to Singapore might be different when viewed from New York compared to Berlin. Travel aggregators need to know the price of every flight from as many IP addresses in the world as possible to have the best product. Now, if they only have IP addresses in Singapore, China, and a few places in the US, and someone's trying to fly between two places in Europe, it would be very difficult for them to scrape the right prices. The network unlocks more use cases as it scales, which is exciting to see. We recently started observing this ourselves with the variety of data we can scrape.
As the network grows, do you think users' rewards will be diluted? Or will it find an equilibrium due to the network being more profitable?
I'll try to answer this without making any forward-looking statements, for obvious reasons. The first variable is that the network is very close to usable right now, which is why during this beta period, we've opted to compensate uptime. We don't plan on rewarding users for uptime indefinitely.
So, right now is the only time you can earn points just for keeping the device online. In the future, the node will only be compensated for actual bandwidth usage. Now, as you mentioned, regarding equilibrium, what I mentioned earlier about travel is a prime example.
In that sector, you can never have enough nodes. For a travel aggregator to stay competitive, the most competitive aggregator is actually the one that has access to the most nodes. So, if you're able to unlock that, they'll only put more content, more throughput through the network.
So now I'd like to move onto Solana specifically. What motivated your decision to build on Solana?
For what we're trying to do, having a high-throughput chain is obviously very important. When the Grass network goes live, it will be one of the largest crypto protocols out there, just in terms of sheer user numbers. This necessitates having very low gas fees to incentivize users. Solana is by far the most gas-efficient and is probably the fastest out there. Some of the upcoming updates, like FireDancer, are extremely exciting because parallel transactions are exactly what we need.
From a business development perspective, we'd love to partner with some of the other dePIN protocols out there, and most of that makes sense on Solana for the reasons I just mentioned. One thing we found really cool is that Solana has its own phone, and we believe that phone will only increase in adoption. That's something no other chain can really offer. It was kind of an obvious play for us to have an app on the Solana phone.
I’m also a big believer in the Saga phone and the DePin narrative in general. Have you looked to any of the other players in the DePin space, like Helium for instance, for inspiration?
Absolutely, yeah. The entire philosophy behind it really is about you, as a person living your daily life and paying for these services. Not only are you overpaying for a lot of things in your life, but you're also being robbed of things that you could be monetizing.
This recent push towards decentralisation, and some of the really cool things happening with Helium Mobile, for instance, and the Saga phone, is opening everyone's eyes. It's like, oh my goodness, I am sitting on so many resources that are, in many cases, just being stolen out from under me. We're all being wronged, and we've all been okay with it. But now, people are being shown another path where you have the option to not be okay with it. That's very powerful, and it's not something I would want to miss. So, we've certainly taken a lot of inspiration from that.
Looking to the future, what does 2024 look like for Grass? Can you give us some insights into your roadmap?
I think it won't come as a surprise to anyone to hear that we do plan on fully launching onto the network at some point in 2024.
Beyond that, in the roadmap, we want to implement the proof of request using zk-TLS, tying web requests to datasets, probably in the latter half of the year. We also plan to decentralise a lot of our sequencers. It's TBD how that will be implemented, but we have a lot of exciting ideas in store there that will allow people to have even more accessibility to running the base infrastructure of Grass.
One other thing we're playing around with is the idea of hardware. Right now, the cost of onboarding is zero, and we love that and plan to keep it that way forever. But let's say you don't want to keep your device online 24/7, or you don't want this node running on your device for whatever reason. We want to give people the option to just buy a box, hook it up to their internet, and have that run in the background. Beyond just personal preferences, one exciting aspect of having hardware is that we can actually put AI agents in that hardware and allow them to run inside it. They can do a lot of the web scraping and crawling for you. All you would have to do is sit back and let those AI agents run these jobs, almost like having a self-driving car that maps things out.
We’d like people to have the options. If you want to contribute more to the network, then we want there to be a device available that is capable of doing that.
There are some small things we're working on, like new gamification aspects for the dashboard. We also want to add a few Easter egg features specifically for Saga users and are exploring ideas in that space. In addition to that, we're looking at distributions for other devices. Right now, it's not just about the web extension; we're considering making it downloadable for those who want it. There are a lot of people, for instance, who don't like to install extensions, and that's totally fine. So, we're planning to expand to other platforms like Android, iOS, Raspberry Pi, Linux, and so on.
Essentially, we want to give people more options to be able to join the network.
How do you see the governance structure of Grass? Will it be a fully decentralised network owned by the community?
We have a few different stages towards decentralisation. The first one is the certification and certification mechanism, where we're capable of rewarding users for their contributions on-chain.
The second stage involves the decentralisation of our sequencers, as well as some of the scraping approval request stuff. Governance plays a key role here. We essentially want to become this massive data provisioning network where community members can say, "Hey, I'm training this AI model, I need these types of datasets, I would like to propose that we switch our scraping efforts to scrape those datasets." The sequencers can then double as validators to ensure that the right data is being scraped.
One of the few governance features we want to include is protecting the network. In a decentralised network, you generally achieve market efficiency over time if executed properly. There are many apps out there offering to monetize your unused CPU, GPU, etc., often dealing in Fiat. They might start off paying a certain rate to onboard members and then decrease that rate over time to a point where earnings become very minimal. The app just runs in the background and becomes part of the problem it was meant to solve in the first place.
With a governance structure, you protect the community because those who contribute to the network actually own a piece of the network. This is the state we want to reach where everyone running nodes in the Grass network owns their piece of the network itself.
Do you think you have enough scale to theoretically launch the network now? Or do you still want to grow the number of nodes before launch?
In terms of the overall number of nodes, we're very close to our target. However, in specific geographies, we're actually not that close. There are certain geographies where people want to scrape specific types of content, and the demand there is actually higher than the supply. We want to make sure that we're capable of meeting all the demand and capturing an unbiased view of the internet from as many nodes as possible. That's our goal for launching the network.
As you know, we're in beta, so we're working our best to ensure the network is scalable. As we've grown much faster than anticipated, people have been facing some issues with onboarding or with their dashboard display. These are all kinks that we plan on ironing out prior to the full network launch. That's why we're still in beta. So, there are a number of factors we're considering in terms of the number of nodes. Overall, we're pretty happy with where we are though.
And that’s your alpha.
Thanks for reading alpha please! Subscribe for free to receive new posts and support my work.
Not financial or tax advice. This newsletter is strictly educational and is not investment advice or a solicitation to buy or sell any assets or to make any financial decisions. Crypto currencies are very risky assets and you can lose all of your money. Do your own research