
Luel
Turning everyday words and actions into usable training data.
$9.6B
AI dataset TAM by 2029
~27.7% CAGR · multimodal fastest-growing
$29.2B
Data collection & labelling by 2032
~28.5% CAGR (Allied Market Research)
Days
To-spec delivery target
Custom multimodal w/ audit-ready artifacts
Thesis
- 01
Rights-cleared provenance is becoming a procurement prerequisite. The NYT suit against OpenAI, the wave of publisher licensing deals (Stack Overflow, Reddit, Axel Springer), and the EU AI Act have collectively turned "where did this data come from" into a checkbox the buyer will ask before they sign.[3] [17] [18] [19] [8] Luel's bundle — consent evidence, chain-of-title, QA logs — is the artifact buyers need to clear that checkbox.
- 02
Multimodal and robotics models need data the web cannot supply. Web-scale corpora are insufficient for state-of-the-art robotics, egocentric video, and specialist dialogue. Google's RT-1 work is a public example of the alternative: large, purpose-collected teleoperation datasets gathered by hand.[5] Luel's edge-case focus — niche languages, robotics POV, patient-doctor conversations — targets exactly the high-pain, high-value gaps.
- 03
Marketplace plus re-licensable catalog economics compound. Owning non-exclusive rights to a collection lets Luel re-license it at high gross margins after collection costs are amortized — closer to stock-media dynamics, but with AI-grade QA and metadata. DefinedCrowd raised $65M building a less rights-focused version of this thesis.[14] [15]
- 04
A concentrated buyer set lets the winner scale fast. Frontier labs are few but spend heavily — Scale's $1B round at a $13.8B valuation and Surge's reported $1B+ raise at >$15B are the market signal.[12] [13] Becoming the default rights-cleared partner for time-sensitive, compliance-gated budgets is a category-defining outcome.
Problem
The web has been scraped. Now labs need data the web never had.
Frontier AI labs need rights-cleared multimodal training data at scale, and the most accessible source — the public web — has been substantially exhausted for the modalities that matter most.[2] Most available datasets fail production requirements: unclear rights, weak provenance, missing consent, inconsistent metadata.[1] The shortcuts that worked five years ago no longer work.
At the same time, the cost of using shortcuts has gone up. The NYT v. OpenAI suit, the YouTube transcript controversy, and the EU AI Act have collectively pushed legal, brand, and procurement teams to demand explicit licensing and audit-ready documentation before a dataset can enter pre-training.[3] [4] [8]
The remaining option is to commission real-world data: egocentric video from licensed contributors, professional dialogues with consent, robotics teleoperation footage shot to spec. That is operationally hard. It involves recruiting and vetting contributors, capturing consent, running multi-stage QA, and shipping artifacts the procurement team can sign on. Most labs don't want to build the operation in-house — and the labs that have built it (Scale, Surge, Mercor) are not optimized for rights-cleared catalog re-licensing.
$1B+
Surge AI raise (reported)
>$15B valuation · frontier lab dependency
$1B
Scale AI's 2024 round
$13.8B valuation · category gravity
$65M+
DefinedCrowd total raise
Closest rights-cleared marketplace analog
Why Now
Four forces hit at once. Rights-cleared multimodal data is the convergence trade.
Exhaustion of public corpora, regulatory hardening, the rise of multimodal / robotics, and the maturation of contributor-marketplace operations all arrive in the same 24-month window.
Public data ran out and procurement teams stopped looking the other way.
Data exhaustion is no longer hypothetical. Epoch AI's projections put high-quality public text data on a finite curve, with constraints intensifying as training compute scales.[2] Multimodal demand makes the gap larger, not smaller — real-world video, audio, and robotics footage was never on the web in usable quantities to begin with.
Lawsuits and licensing deals reshaped the buyer's risk model. The NYT v. OpenAI litigation, OpenAI's deal with Stack Overflow, Google's licensing arrangement with Reddit, and Axel Springer's partnership with OpenAI together signal that the era of "scrape now, apologize later" is closing.[3] [17] [18] [19] Buyers now want a paper trail.
The EU AI Act hardens the requirement. Final approval in May 2024 codified provenance and consent expectations for high-risk modalities — health, biometric, PII — and effectively exports those expectations globally for any lab serving European customers.[8]
Multimodal and robotics need bespoke collection. State-of-the-art robotics systems like RT-1 rely on hand-collected teleoperation datasets, not scrapes.[5] RLHF practice has converged on the conclusion that quality, diverse human data — not sheer volume — is the bottleneck.[6]
Frontier AI labs need rights-cleared multimodal training data at scale, but public web data is exhausted and most available datasets fail production requirements due to unclear rights, weak provenance, missing consent, and inconsistent metadata.
How It Works
Spec in. Audit-ready dataset out. In days, not quarters.
Two product motions on one operational substrate.
Bespoke collection. The high-margin, high-touch motion: a frontier lab needs gemstone manufacturing footage, or patient-doctor dialogues, or egocentric data from cooks in a commercial kitchen. Luel turns the request around in days with the full paper trail. Customers pay for speed and for the procurement-ready artifacts.[1]
Off-the-shelf catalog. The compounding motion: collections that were funded by a bespoke job become re-licensable inventory. The customer gets a fast start; Luel gets a margin profile closer to stock-media than to services. Inigo's published Ego-Realm dataset on Hugging Face is an early demonstration of the catalog motion.
Compliance is the product, not a wrapper. Standard license templates, usage scopes, consent revocation flows, and documentation aligned to EU AI Act risk categories are not features bolted on — they are why the buyer chooses Luel.[8]
Interoperability. APIs, SDKs, and delivery formats are designed to drop into the lab's existing pre-training, fine-tuning, and evals pipelines so the dataset doesn't sit in a slow review queue.
The artifacts that ship with every dataset
Paper trailThe bundle that lets a frontier lab clear procurement, legal, and brand review without a multi-month back-and-forth.
Market
A market that is small today and inevitable tomorrow.
AI training datasets and the broader data collection & labelling services market are both compounding at roughly 27–28% annually — multimodal is the fastest-growing segment.
The training dataset market is tripling inside the next five years.
The AI training datasets market sits at roughly $2.82B in 2024 and is projected to reach ~$9.58B by 2029 at ~27.7% CAGR, with multimodal as the fastest-growing segment.[9]
The broader data collection and labelling services market is roughly $3.0B in 2023 heading to ~$29.2B by 2032 at ~28.5% CAGR.[10] Publisher and platform licensing deals — Stack Overflow, Reddit, Axel Springer — validate willingness to pay for rights-cleared content at the upper end of that range.[17] [18] [19]
Two structural tailwinds compound on top of the headline numbers. First, modern robotics and multimodal systems require real-world egocentric and device-specific data that scraping cannot provide.[5] Second, public-data exhaustion drives a premium on curated, re-licensable corpora — and pushes more of the spend toward bespoke collection rather than pre-existing dumps.[2]
Competitive landscape
Two incumbents, two adjacents, one open lane.
Labeling/RLHF incumbents (Scale, Surge, Appen) dominate annotation. Rights-cleared marketplaces (Defined.ai) and consumer data apps (Kled, Sapien) are adjacent. The "rights-cleared raw multimodal data" lane is where Luel differentiates.
Luel differentiates on rights trail plus speed for bespoke multimodal collections — and on the catalog re-licensing economics that the labeling incumbents are not optimized to build.
Founder deep dive
A two-founder team that walked away from Berkeley to build the data layer.
Founders
Founder signal
Risks & mitigations
What we're watching
References
- [1]Y Combinator — Luel company profile
- [2]Epoch AI — Will we run out of data?
- [3]Reuters — New York Times sues OpenAI and Microsoft over copyright
- [4]The Verge — OpenAI reportedly used YouTube video transcriptions for training
- [5]Google Robotics — RT-1 Robotics Transformer (real-world data collection)
- [6]Latent Space — RLHF 201 (Nathan Lambert)
- [7]a16z Policy — AI, copyright, and fair use (submission)
- [8]Council of the EU — AI Act final approval (press release)
- [9]MarketsandMarkets — AI Training Dataset Market (press release)
- [10]Allied Market Research — Data Collection and Labelling Market
- [11]Reuters — U.S. Labor Department investigating Scale AI
- [12]Reuters — Scale AI raises $1B at ~$13.8B valuation
- [13]Reuters — Surge AI seeks up to $1B raise at >$15B valuation
- [14]TechCrunch — DefinedCrowd raises $50.5M Series B
- [15]TechCrunch — DefinedCrowd raises additional $15M
- [16]VentureBeat — Sapien raises $5M seed (Train2Earn)
- [17]Reuters — OpenAI signs deal with Stack Overflow
- [18]Reuters — Google reaches content licensing deal with Reddit
- [19]Axel Springer — Partnership with OpenAI (press)
- [20]ArXiv — Nightshade: Prompt-specific poisoning attacks
- [21]MIT Technology Review — Artists use Nightshade to poison AI models


