Chinese Startup Unveils AI Video Software to Rival OpenAI’s Sora

Chinese Startup Unveils AI Video Software to Rival OpenAI’s Sora

Shengshu-AI claims its Vidu software can develop high-quality videos lasting up to 16 seconds, far surpassing previous Chinese text-to-video models.

By Ye Zhanhang

Apr 29, 2024#technology

A Chinese startup has unveiled an artificial intelligence-powered system capable of generating high-definition videos lasting up to 16 seconds, marking a major breakthrough for China’s AI industry as it races to catch up with the United States’ leading firms.

Shengshu-AI, a Beijing-based startup that was founded only last year, presented the new system — which it has named Vidu — at the Zhongguancun Forum in Beijing on Saturday, describing it as China’s “first long-duration, high-consistency, and highly dynamic video generation model.”

Many in China have been quick to dub Vidu China’s answer to Sora, the text-to-video model created by OpenAI that sent shockwaves around the world when it was unveiled in February.

For now, it appears that Vidu is still some way from matching Sora’s capabilities. According to Shengshu-AI, Vidu can generate high-definition videos lasting up to 16 seconds, whereas Sora can generate 60-second clips.

But this would still put Vidu at the very cutting edge of the rapidly evolving AI-generated content field. Most of the leading text-t0-video models, including Pika and Gen-2, only produce clips lasting up to 4 seconds.

Unlike those models, Vidu is not yet publicly available, and Shengshu-AI has yet to confirm when it will be formally launched. But the company performed a live demonstration of the system at the forum and said it was open to working with partners to further fine-tune its technology.

Shengshu-AI is one of many startups to have emerged during the frenzy of AI-related investment in China since the release of OpenAI’s ChatGPT in late 2022.

The firm was founded in March 2023 with Zhu Jun, a leading AI researcher at Beijing’s prestigious Tsinghua University, joining as chief scientist. It has since raised over 100 million yuan ($14 million) from investors, including the Chinese tech giants Ant Group and Baidu.

At the Zhongguancun Forum, Zhu said that Vidu was capable of generating scenes that are consistent with the laws of physics and contain rich details, such as realistic shadow effects and facial expressions.

In another nod to Shengshu-AI’s ambitions to rival OpenAI, the live demonstration of Vidu that followed featured a video almost identical to the one used to launch Sora — a clip of a car driving along a mountain road.

The primary technology underpinning Vidu is the Universal Vision Transformer, which combines two AI models: Transformer and Diffusion. It is similar to Sora’s Diversity in Transformation architecture, but Shengshu-AI claims that its research team developed its system before OpenAI, releasing a related paper in September 2022.

“After Sora’s release in February, we found that our technical roadmaps are highly aligned, and we became even more determined to press forward with our own research,” Zhu said at the forum.

The release of Sora earlier this year astonished many in China, as the technical challenges involved in generating AI video far surpass those involved in creating text and still images. The hashtag “Sora” received over 100 million views on the Chinese microblogging platform Weibo within a week of the product’s launch.

Within China’s AI industry, there were fears that the launch of Sora showed that the gap between Silicon Valley and China was widening. But Shengshu-AI has been bullish about its ability to catch up with the U.S.’s market leaders.

As recently as February, Vidu was reportedly only capable of generating 4-second clips, but that has increased fourfold in just a few months. In March, Shengshu-AI’s CEO, Tang Jiayu, told domestic media: “It’s certain that the model can reach Sora’s level this year, though it’s difficult to say whether it will take three months or six months.”

With its demonstration of Vidu, Shengshu-AI has proved itself a leader in China’s AI sector, Chen Chen, a partner at consultancy Analysys, told domestic media. Yet Sora remains far ahead in terms of the duration, diversity, and richness of its videos, Chen added.

China’s tech industry continues to invest heavily in AI content generation. Major AI models including ChatGPT, Stable Diffusion, and Midjourney are unavailable in China, leaving a large hole in the market for domestic firms to fill.

In recent months, major tech firms including ByteDance, Kuaishou, Tencent, and SenseTime, as well as a host of smaller players, have reported progress in developing text-to-video AI tools. However, several have stressed that their products remain in their infancy.

According to market researchers iResearch, the value of China’s AI-generated content market is predicted to grow at 87% annually for the remainder of the decade, twice the speed of the global market.

(Header image: Shengshu Technology and Tsinghua University launch Vidu, a text-to-video model, at the 2024 Zhongguancun Forum in Beijing, April 27, 2024. CNS)