Inception, a new Palo Alto-based company started by Stanford computer science professor Stefano Ermon, claims to have developed a novel AI model based on “diffusion” technology. Inception calls it a diffusion-based large language model, or a “DLM” for short.
The generative AI models receiving the most attention now can be broadly divided into two types: Large Language Models (LLMs) and diffusion models. LLMs, built on the transformer architecture, are used for text generation. Meanwhile, diffusion models, which power AI systems like Midjourney and OpenAI’s Sora, are mainly used to create images, video, and audio.
Inception’s model offers the capabilities of traditional LLMs, including code generation and question-answering, but with significantly faster performance and reduced computing costs, according to the company.
Ermon told TechCrunch that he has been studying how to apply diffusion models to text for a long time in his Stanford lab. His research was based on the idea that traditional LLMs are relatively slow compared to diffusion technology.
With LLMs, “you cannot generate the second word until you’ve generated the first one, and you cannot generate the third one until you generate the first two,” Ermon said.
Ermon was looking for a way to apply a diffusion approach to text because, unlike with LLMs, which work sequentially, diffusion models start with a rough estimate of data they’re generating (e.g. a picture), and then bring the data into focus all at once.
Ermon hypothesized generating and modifying large blocks of text in parallel was possible with diffusion models. After years of trying, Ermon and a student of his achieved a major breakthrough, which they detailed in a research paper published last year.
Recognizing the advancement’s potential, Ermon founded Inception last summer, tapping two former students, UCLA professor Aditya Grover and Cornell professor Volodymyr Kuleshov, to co-lead the company.
While Ermon declined to discuss Inception’s funding, TechCrunch understands that the Mayfield Fund has invested.
Inception has already secured several customers, including unnamed Fortune 100 companies, by addressing their critical need for reduced AI latency and increased speed, Emron said.
“What we found is that our models can leverage the GPUs much more efficiently,” Ermon said, referring to the computer chips commonly used to run models in production. “I think this is really, like, a big deal, because I think this is going to change the way people build language models.”
Inception offers an API as well as on-premises and edge device deployment options, support for model fine-tuning, and a suite of out-of-the-box DLMs for various use cases. The company claims its DLMs can run up to 10x faster than traditional LLMs while costing 10x less.
“Our ‘small’ coding model is as good as [OpenAI’s] GPT-4o mini while more than 10 times as fast,” a company spokesperson told TechCrunch. “Our ‘mini’ model outperforms small open-source models like [Meta’s] Llama 3.1 8B and achieves more than 1,000 tokens per second.”
“Tokens” is industry parlance for bits of raw data. One thousand tokens per second is an impressive speed indeed, assuming Inception’s claims hold up.
 
					 
				 
										 
										 
										 
										 
										 
										 
										 
										