BEIJING — Alibaba Cloud has taken a significant step into the evolving landscape of artificial intelligence by leading a $290 million investment in ShengShu, a three-year-old startup developing innovative AI models that aim to more accurately simulate the physical world. The funding round, announced on Friday, underscores a growing recognition among tech giants of the limitations of traditional large language models like OpenAI's ChatGPT, which rely heavily on text data. Instead, ShengShu is focusing on what it calls a "general world model," designed to integrate multimodal data including vision, audio, and touch to better capture real-world dynamics.
The investment, equivalent to 2 billion yuan, was led by Alibaba Cloud, with participation from TAL Education and Baidu Ventures in this series B round. This comes just two months after ShengShu secured 600 million yuan from investors including Qiming Venture Partners. The startup, which powers the AI video generation tool Vidu, declined to disclose its current valuation but stated that the new capital will accelerate the development of its world model technology. This approach seeks to bridge the gap between digital realms, such as games and AI-generated videos, and physical applications like autonomous driving and robotics.
"ShengShu believes that a general world model, built on multimodal data such as vision, audio, and touch, more naturally captures how the physical world works than large language models," the company said in a statement released Friday. The model is intended to enable AI systems to connect perception with action, allowing for more consistent prediction and modeling of real-world behaviors. Zhu Jun, founder of ShengShu, elaborated in the same statement: "We aim to connect perception and action," highlighting the goal of creating AI that can interact more intuitively with its environment.
ShengShu's Vidu platform has already gained traction in the competitive AI video generation space. Its latest iteration, Vidu Q3 Pro, released in January, ranks among the top 10 models for generating videos from text and images, according to Artificial Analysis, an independent benchmarking firm. The company launched Vidu globally well before OpenAI made its Sora tool widely available—though Sora was later shuttered. In China, rivals like Kuaishou and ByteDance have also rolled out similar AI video tools, intensifying competition in the sector.
Alibaba's move aligns with its broader strategy to invest in AI technologies that extend beyond e-commerce roots. Just last month, Alibaba and Baidu Ventures co-led a $50 million investment in Tripo AI, a platform that uses AI to generate 3D models from photographs. Tripo AI is similarly shifting away from language model techniques toward tools grounded in physical space, and it announced plans to develop its own world model. In September, Alibaba led a $60 million round for PixVerse, which earlier this year released an AI world model enabling users to guide video generation in real time.
Alibaba itself has been active in open-source AI development. The company has released free models for video generation and, in February, launched one specifically for powering robots. These efforts position Alibaba as a key player in China's push to advance AI capabilities, particularly in areas with practical applications. ShengShu emphasized on Friday that it has forged strategic partnerships with firms working on embodied AI—systems like humanoid robots designed for industrial, commercial, and home use.
The emphasis on world models reflects a broader industry pivot as developers confront the shortcomings of large language models, or LLMs, which excel at processing text but struggle with spatial reasoning and physical interactions. Experts argue that for AI to achieve human-like intelligence, it must incorporate an understanding of the physical world alongside reasoning and continuous learning. Kevin Kelly, co-founder of Wired magazine, wrote last month on his Substack that world models are essential for robotics, noting that such technology requires more than LLMs to function effectively.
"Ultimately, to replicate human intelligence, AI will need three things: reasoning, an understanding of the physical world and continuous learning," Kelly stated. He pointed out that while LLM-powered chatbots have advanced the reasoning aspect, the physical world understanding remains a critical gap, making world models a priority for breakthroughs. Kelly's commentary, published in March, aligns with the trends observed in recent investments like Alibaba's in ShengShu.
In the context of global AI development, China's tech ecosystem is rapidly expanding in multimodal AI. ShengShu's Vidu, for instance, has been praised for its ability to produce high-quality videos that incorporate realistic physics and movements, setting it apart from purely text-based systems. The startup's Beijing headquarters serve as a hub for this innovation, drawing talent from across the country's vibrant AI research community.
However, challenges persist. Developing world models demands vast amounts of diverse data, including video footage and sensor inputs, which raises concerns about privacy and computational resources. Alibaba Cloud, with its extensive infrastructure, could provide a backbone for scaling these efforts. The investment also highlights tensions in the U.S.-China tech rivalry, as American firms like OpenAI dominate LLMs but lag in some multimodal applications due to export restrictions on advanced chips.
Looking ahead, ShengShu plans to apply its world model to enhance robot autonomy, potentially revolutionizing sectors like manufacturing and healthcare. Partnerships with embodied AI developers could lead to prototypes within the next year, according to industry observers. Alibaba's CEO, Eddie Wu, has previously stressed the importance of AI in driving the company's growth, and this investment fits into that vision.
The funding round was announced amid a flurry of AI advancements in Asia. Baidu, another participant, has its own Ernie Bot, but is increasingly focusing on multimodal integrations. TAL Education, known for edtech, sees potential in AI for personalized learning environments that simulate real-world scenarios.
As AI evolves, the shift toward world models could democratize advanced robotics, making them more accessible for everyday use. Yet, ethical questions about AI's role in physical spaces—such as job displacement in automation-heavy industries—loom large. Regulators in China and beyond are watching closely, with calls for frameworks to ensure safe deployment.
For now, Alibaba's bet on ShengShu signals confidence in a future where AI doesn't just converse but comprehends and acts in the tangible world. With $290 million fueling this ambition, the startup is poised to push boundaries, potentially reshaping how we interact with intelligent machines.
This development comes at a time when global AI investment hit record highs in 2025, with Asia accounting for nearly 40% of venture funding, per PitchBook data. ShengShu's rapid growth—from seed stages to multiple rounds in under three years—exemplifies the pace of innovation in the field.
