Luel — Investment Memo · Orange Collective

Thesis

Public web data is exhausted, the EU AI Act's general-purpose AI obligations are now in force, and Meta's 49% stake in Scale AI has pushed frontier labs toward neutral data partners.^[2] ^[33] ^[24] Luel is building the rights-cleared default — a marketplace and collection engine that delivers bespoke multimodal datasets to frontier labs in days, then re-licenses the resulting catalog at high margins.^[1] The market has started to confirm the thesis: a $31.2M seed co-led by General Catalyst and Lightspeed in May 2026 — one of the largest seed rounds in YC history — with $2M ARR reached within weeks of demo day.^[22] ^[23]

01
Rights-cleared provenance is now law, not preference. Since August 2, 2025, GPAI providers serving the EU must publish training-data summaries on a mandatory Commission template and document copyright compliance.^[33] Combined with the NYT suit against OpenAI and the wave of publisher licensing deals (Stack Overflow, Reddit, Axel Springer), "where did this data come from" is a checkbox the buyer must clear before they sign.^[3] ^[17] ^[18] ^[19] ^[8] Luel's bundle — consent evidence, chain-of-title, QA logs — is the artifact that clears it.
02
Multimodal and robotics models need data the web cannot supply. Web-scale corpora are insufficient for state-of-the-art robotics, egocentric video, and specialist dialogue. Google's RT-1 work is a public example of the alternative: large, purpose-collected teleoperation datasets gathered by hand.^[5] Luel's collections go further — physics layers for embodied AI including sensor streams, device pose, and hand-object interaction — data that never existed on the web at all.^[23]
03
Marketplace plus re-licensable catalog economics compound. Owning non-exclusive rights to a collection lets Luel re-license it at high gross margins after collection costs are amortized — closer to stock-media dynamics, but with AI-grade QA and metadata. The licensing comps put hard numbers on willingness to pay: Google pays Reddit a reported $60M a year, OpenAI an estimated ~$70M, and News Corp's OpenAI deal is worth up to $250M over five years.^[30] ^[32] ^[31] DefinedCrowd raised $65M building a less rights-focused version of this thesis.^[14] ^[15]
04
The Meta–Scale deal opened a neutrality window. Frontier labs are few but spend heavily — Scale sits at a $29B valuation after Meta's $14.3B stake, Surge has been in talks at ≥$25B, and Mercor raised at $10B.^[24] ^[28] ^[29] But OpenAI and Google pulled work from Scale within days of the Meta announcement, with Google's ~$200M planned 2025 spend in motion — and both Surge and Mercor are annotation-first.^[25] ^[26] The rights-cleared raw multimodal lane is still open, and the buyers are actively shopping for neutral partners.

Problem

The web has been scraped. Now labs need data the web never had.

Frontier AI labs need rights-cleared multimodal training data at scale, and the most accessible source — the public web — has been substantially exhausted for the modalities that matter most.^[2] Most available datasets fail production requirements: unclear rights, weak provenance, missing consent, inconsistent metadata.^[1] The shortcuts that worked five years ago no longer work.

At the same time, the cost of using shortcuts has gone up. The NYT v. OpenAI suit, the YouTube transcript controversy, and the EU AI Act have collectively pushed legal, brand, and procurement teams to demand explicit licensing and audit-ready documentation before a dataset can enter pre-training.^[3] ^[4] ^[8]

The remaining option is to commission real-world data: egocentric video from licensed contributors, professional dialogues with consent, robotics teleoperation footage shot to spec. That is operationally hard. It involves recruiting and vetting contributors, capturing consent, running multi-stage QA, and shipping artifacts the procurement team can sign on. Most labs don't want to build the operation in-house — and the vendors that built it at scale have new problems of their own. After Meta took 49% of Scale AI in June 2025, neutrality itself became a procurement question: OpenAI wound down its Scale work and Google moved to cut ties, sending labs shopping for independent partners.^[24] ^[25] ^[26]

$29B

Scale AI valuation

Meta's 49% stake (Jun 2025) · neutrality now in question

≥$25B

Surge AI raise talks

Reported ~$1B round (Bloomberg, Jul 2025)

$10B

Mercor Series C

$350M round (Oct 2025) · ~$1.5M/day to contractors

Why Now

Five forces hit at once. Rights-cleared multimodal data is the convergence trade.

Exhaustion of public corpora, regulatory hardening, the post-Meta neutrality shock, the rise of multimodal / robotics, and the maturation of contributor-marketplace operations all arrive in the same 24-month window.

Public data ran out and procurement teams stopped looking the other way.

Data exhaustion is no longer hypothetical. Epoch AI's projections put high-quality public text data on a finite curve, with constraints intensifying as training compute scales.^[2] Multimodal demand makes the gap larger, not smaller — real-world video, audio, and robotics footage was never on the web in usable quantities to begin with.

Lawsuits and licensing deals reshaped the buyer's risk model. The NYT v. OpenAI litigation, OpenAI's deal with Stack Overflow, Google's licensing arrangement with Reddit, and Axel Springer's partnership with OpenAI together signal that the era of "scrape now, apologize later" is closing.^[3] ^[17] ^[18] ^[19] Buyers now want a paper trail.

The EU AI Act's GPAI obligations are now in force. Final approval in May 2024 codified provenance and consent expectations; since August 2, 2025, GPAI providers must publish training-data summaries on a mandatory Commission template and maintain a copyright-compliance policy.^[8] ^[33] Provenance paperwork moved from best practice to legal requirement — and it effectively exports globally for any lab serving European customers.

Neutrality became a buying criterion. Meta's $14.3B stake in Scale (June 2025) triggered an immediate customer exodus: OpenAI phased out its Scale work, Google planned to walk from roughly $200M of 2025 spend, and Scale's competitors reported an influx of labs seeking neutral partners.^[24] ^[25] ^[26] Independent vendors are catching budgets that were locked up a year ago.

Multimodal and robotics need bespoke collection. State-of-the-art robotics systems like RT-1 rely on hand-collected teleoperation datasets, not scrapes.^[5] RLHF practice has converged on the conclusion that quality, diverse human data — not sheer volume — is the bottleneck.^[6]

How It Works

Spec in. Audit-ready dataset out. In days, not quarters.

Two product motions on one operational substrate.

Bespoke collection. The high-margin, high-touch motion: a frontier lab needs gemstone manufacturing footage, or patient-doctor dialogues, or egocentric data from cooks in a commercial kitchen. Luel turns the request around in days with the full paper trail. Customers pay for speed and for the procurement-ready artifacts.^[1]

Off-the-shelf catalog. The compounding motion: collections that were funded by a bespoke job become re-licensable inventory. The customer gets a fast start; Luel gets a margin profile closer to stock-media than to services. Inigo's published Ego-Realm dataset on Hugging Face is an early demonstration of the catalog motion.

Compliance is the product, not a wrapper. Standard license templates, usage scopes, consent revocation flows, and documentation aligned to EU AI Act risk categories are not features bolted on — they are why the buyer chooses Luel.^[8]

Interoperability. APIs, SDKs, and delivery formats are designed to drop into the lab's existing pre-training, fine-tuning, and evals pipelines so the dataset doesn't sit in a slow review queue.

The artifacts that ship with every dataset

Paper trail

Consent evidenceChain-of-titleLicense templatesQA logsProvenance metadataUsage scopes

The bundle that lets a frontier lab clear procurement, legal, and brand review without a multi-month back-and-forth.

Traction & Round

One of the largest seed rounds in YC history, weeks after demo day.

$31.2M

Seed round · May 2026

Co-led by General Catalyst + Lightspeed

1M+

Submissions through QA

40+ active dataset campaigns at any time

96

Countries in the network

500K+ vetted contributors

The round priced the wedge. The revenue arrived before the round did.

In May 2026, Luel announced a $31.2M seed co-led by General Catalyst and Lightspeed — one of the largest seed rounds in Y Combinator's history.^[22] ^[23] Additional backers include Paul Graham, SV Angel, Human Capital, and Orange Collective.^[22]

The traction preceded the capital: $2M ARR within roughly six weeks of demo day, over a million submissions processed through the QA pipeline, and 40+ dataset campaigns running concurrently.^[22] The customer base already spans generative AI labs, robotics companies, speech research teams, major social platforms, universities, hospitals, and banks — broader than the frontier-lab-only wedge we expected at memo time.^[22] ^[23]

Lightspeed's stated rationale is the "data wall": models exhausting public web data and requiring massive net-new, human-generated, rights-cleared data across modalities and geographies.^[23] That is the same thesis as this memo — now underwritten at institutional size.

Market

A market that is small today and inevitable tomorrow.

AI training datasets and the broader data collection & labelling services market are both compounding at roughly 27–28% annually — multimodal is the fastest-growing segment.

The training dataset market is tripling inside the next five years.

The AI training datasets market sits at roughly $2.82B in 2024 and is projected to reach ~$9.58B by 2029 at ~27.7% CAGR, with multimodal as the fastest-growing segment.^[9]

The broader data collection and labelling services market is roughly $3.0B in 2023 heading to ~$29.2B by 2032 at ~28.5% CAGR.^[10] Publisher and platform licensing deals validate willingness to pay at the upper end of that range: Google pays Reddit a reported $60M a year, OpenAI's Reddit deal is estimated at ~$70M a year, and News Corp's OpenAI agreement is worth up to $250M over five years.^[30] ^[32] ^[31]

Two structural tailwinds compound on top of the headline numbers. First, modern robotics and multimodal systems require real-world egocentric and device-specific data that scraping cannot provide.^[5] Second, public-data exhaustion drives a premium on curated, re-licensable corpora — and pushes more of the spend toward bespoke collection rather than pre-existing dumps.^[2]

Competitive landscape

A $29B incumbent in a neutrality crisis, a $25B bootstrapper, and one open lane.

Labeling/RLHF incumbents (Scale, Surge, Mercor, Appen) dominate annotation. Rights-cleared marketplaces (Defined.ai) and consumer data apps (Kled, Sapien) are adjacent. The "rights-cleared raw multimodal data" lane is where Luel differentiates — and the Meta–Scale deal put the largest incumbent's neutrality in question.

Luel differentiates on rights trail plus speed for bespoke multimodal collections, on catalog re-licensing economics the labeling incumbents are not optimized to build — and on independence, at the exact moment labs are fleeing a Meta-owned incumbent.

— Luel's wedge^[1]

Founder deep dive

A two-founder team that walked away from Berkeley to build the data layer.

The pair. William and Inigo met as Berkeley roommates — and, reportedly, as competitive Fortnite players before that — then dropped out together after getting into YC, having circled the data space for two years.^[22] The founder dynamic — a Berkeley M.E.T. dropout (William) paired with a Berkeley CS dropout (Inigo) — comes with deep prior context. The split is clean: William runs as CEO, Inigo runs ops as COO.

William's path to the problem. A USACO Platinum competitive programmer at 16, William founded computer-vision startup ezML and was a founding engineer at Relixir (YC X25), where he also ran GTM — shipping ML and data products at two early-stage companies before Luel. In parallel he co-authored an NDSS 2025 poster on LLM security and privacy at Northeastern's PEACH Lab — research exposure that maps directly onto the compliance and provenance surface Luel sells into. He also founded HackBlue, a cybersecurity hackathon, organizing students and practitioners around security and tooling work.

Inigo's path to the problem. Inigo is a former machine learning researcher focused on human behavioral modeling — technical familiarity that informs Luel's dataset product and QA practices.^[23] He maintains a Hugging Face account (Inigology) and a Luel organization presence there, and has already published the "Ego-Realm" egocentric dataset sample — demonstrating the rights-cleared, production-ready multimodal data that Luel sells. His pre-Berkeley education was at The King's School, Canterbury.

Why this team for this problem. The problem is half data engineering and half operations — sourcing contributors globally, capturing consent at the source, and shipping artifacts a Fortune 500 legal team will sign. William's prior founding-engineer roles and security research background fit the technical and compliance side. Inigo's ML research background fits the contributor-recruitment and content-collection side. A 500K-person contributor network across 96 countries is an operations problem first, and the split-of-labor lines up cleanly with the two halves of the company.^[22]

On their YC partner. Luel's YC group partner is Harshita Arora — a signal of the partner team's read on the founders.

Founders

Inigo Lenderking

Founder

Co-Founder & COO, Luel | Berkeley Dropout

William Namgyal

Repeat Founder

Founder

Co-Founder & CEO, Luel | Berkeley MET Dropout

William Namgyal

Repeat Founder

Co-Founder & CEO

Berkeley M.E.T. dropout, USACO Platinum competitive programmer at 16, founder of computer-vision startup ezML, and founding engineer + GTM lead at Relixir (YC X25) before co-founding Luel. Co-authored an NDSS 2025 poster on LLM security and privacy as a research intern at Northeastern's PEACH Lab. Founded HackBlue, a cybersecurity hackathon. Now leads Luel as a compliance-forward marketplace and custom collection engine delivering licensed multimodal datasets to enterprise model-training teams.

Inigo Lenderking

Co-Founder & COO

Berkeley CS dropout and former machine learning researcher focused on human behavioral modeling. As COO and co-founder of Luel, runs the two-sided marketplace and collection engine — 500K+ vetted contributors across 96 countries — delivering licensed, audit-ready video and audio datasets to spec. Active on Hugging Face (Inigology) and has published the Ego-Realm egocentric dataset sample to demonstrate Luel's rights-cleared, production-ready multimodal data. Educated at The King's School, Canterbury before Berkeley.

Founder signal

Risks & mitigations

Risk

Incumbent response — a post-Meta Scale, Surge, Appen, or Defined.ai expands into rights-cleared catalog and outspends Luel on enterprise sales.

Mitigation

Win narrow, high-value niches first; press the neutrality advantage while OpenAI and Google budgets are in motion away from Scale; secure non-exclusive rights for catalog compounding; integrate tightly with lab pre-training pipelines so the switching cost grows with each delivery. The $31.2M seed funds the sales and ops build-out a two-person team previously couldn't field.

Risk

Copycat controversy — Kled founder Avi Patel publicly accused Luel of cloning his company's website and business model days after the seed announcement, in a viral X dispute that also included claims about Luel's compliance practices and user numbers.

Mitigation

Make the accusations testable: third-party compliance audits, transaction-level consent records, and named enterprise references that a consumer data app cannot match. Replace surface-level design similarities quickly and let delivery speed, auditable provenance, and the customer list carry the differentiation. Reputational risk with frontier-lab buyers is the real exposure — enterprise diligence, not social media, is where it must be answered.

Additional integrity surface. Data poisoning and extraction risks against open-web sources continue to elevate the value of curated, traceable provenance — Nightshade and related work demonstrate why adversarial checks and closed-loop collection matter.^[20] ^[21] Luel's closed-loop collection model is a structural answer to that surface, but adversarial integrity remains an ongoing engineering investment, not a solved problem.

What we're watching

ARR trajectory past the $2M post-demo-day mark — and how much of the $31.2M goes to contributor liquidity versus QA automation and enterprise sales.
First named frontier-lab logo — explicit references to delivered, in-production datasets with measurable model-quality lift.
Conversion of the post-Meta neutrality window — the OpenAI and Google budgets that left Scale are being re-allocated now; Luel needs logo wins before Surge and Mercor absorb them.
Catalog SKU count and re-license velocity — the inventory side of the business is where margins compound once collection costs are amortized.
EU AI Act GPAI enforcement through 2026 — training-data summaries and copyright-policy obligations are live; every enforcement action is direct tailwind.
Resolution of the Kled dispute — whether the copycat and compliance accusations fade or surface in enterprise diligence.

References