HTML Is Not Dying Tomorrow, but the Web’s Center of Gravity Is Shifting from the DOM to the Model
A sourced analysis of Flipbook, LTX video models, web standards, bandwidth economics, and what model-native interfaces really mean for HTML.
HTML Is Not Dying Tomorrow, but the Web’s Center of Gravity Is Shifting from the DOM to the Model
A recent WeChat article framed Flipbook as a direct assault on HTML: no layout engine, no component tree, no hand-authored interface code—just a live stream of pixels generated by a model. That headline is obviously optimized for shock. Still, after checking Flipbook’s own site, Zain Shah’s public post on X, and the public material around Lightricks’ LTX video models, the underlying idea deserves serious attention.
What Flipbook proposes is not “AI inside a web page.” It is closer to “AI as the page.” Instead of shipping structure first and rendering it later, the system generates the visible interface itself as an image—or, experimentally, as a continuous video stream.
What the source article got right
Three claims from the WeChat article hold up against public sources. First, Flipbook defines each page as an image rather than a DOM tree. Its website says this plainly: every page you land on is an image, and clicking anywhere generates a new image that explores that part in more depth. Second, the text is also rendered by the image model as pixels rather than overlaid through traditional HTML text elements. Third, the system is not detached from the web underneath; Flipbook says the information on screen comes from a combination of agentic web search and the model’s own world knowledge.
That combination matters. Flipbook is not merely a visual skin on top of static content. It is a stack that recombines search, interpretation, layout, rendering, and interaction into a single model-mediated loop.
How this likely works under the hood
Public sources do not disclose the entire internal architecture, so the right way to describe the stack is to separate what is confirmed from what can be responsibly inferred.
Confirmed: Flipbook currently has a static image generation system and an experimental live video stream mode. The site says the video mode animates generated images and creates seamless transitions between them. It also says the feature is resource-intensive and currently combines two separate systems: a custom, highly optimized video generation model and Flipbook’s image generation system.
Confirmed model context: Lightricks’ public LTX materials explain why this kind of experience is now technically plausible. The LTX-Video model card describes it as a DiT-based real-time video generation model capable of producing 30 FPS video at 1216×704 faster than it can be watched. The arXiv paper for LTX-Video describes a Video-VAE with a 1:192 compression ratio and full spatiotemporal self-attention in latent space. Lightricks’ newer LTX-2 line pushes further into synchronized audio-video generation, multi-performance modes, multiscale pipelines, and native 4K output.
Reasonable inference: A system like Flipbook likely runs through six layers.
1. Retrieval and state management. A user prompt or a click on a region of the image has to be translated into intent. “Zoom into this chart bar,” “explain this landmark,” and “follow this branch of the diagram” are all different operations. That requires memory of session history, page state, and prior clicks.
2. Semantic planning. The system then needs a planner that decides how to explain the next state visually. This is where traditional UI concerns—layout, hierarchy, composition, emphasis—are no longer handled by a CSS layout engine alone. They are chosen dynamically by a model or a model-directed planning layer.
3. Image generation. The next screen is synthesized as a visual composition. Because text is baked into the image, the output can mix diagrams, photography, magazine-style composition, labels, and motion cues in one frame. The trade-off is equally clear: text precision, accessibility, copyability, deterministic layout, and responsive semantics become harder than in HTML.
4. Interaction mapping. If users click pixels instead of buttons, the system needs some form of semantic hit-testing. That does not require literal HTML hotspots, but it almost certainly requires a mapping between screen coordinates, generated objects, object identities, and the current semantic state of the session.
5. Video interpolation or continuous generation. Once the system moves from isolated images to a live stream, a video model can animate transitions and preserve continuity across steps. This is where static visual explanation starts turning into a low-latency interface runtime.
6. Real-time infrastructure. A model-generated interface has to send frames down to the client and return user events upstream with low latency. The WeChat article mentions WebSockets and Modal GPU infrastructure; that fits the broader technical pattern. Traditional web apps push rendering work toward the browser. A Flipbook-like stack recenters that work on the server.
Why the idea is genuinely important
The strongest case for this approach is not that it removes HTML. It is that it improves expression for categories of tasks where ordinary interfaces are clumsy. Explaining a benchmark chart, a travel plan, a biological process, or a product workflow often works better as a visual narrative than as cards, tabs, forms, and prose blocks. Flipbook treats the interface less like a document and more like an adaptive explainer.
That makes the model-native approach especially compelling for education, product demos, interactive storytelling, research overviews, early-stage prototyping, and lightweight creativity tools. In those domains, the interface does not need to be a stable enterprise shell first. It needs to be a good guide.
Why HTML is not going away soon
If the question is whether more interfaces will be generated without HTML as the primary expression layer, the answer is yes. If the question is whether HTML itself is about to be displaced across the web, the evidence says no.
First, HTML is not just a visual syntax. It is part of the web’s semantic, accessible, linkable, indexable, copyable, cacheable, and auditable substrate. WHATWG still describes itself as “Maintaining and evolving HTML since 2004.” That alone is a reminder that HTML remains live infrastructure, not a fossil waiting to be swept away.
Second, the economics are radically different. HTTP Archive’s 2024 Web Almanac reports a median page weight of 2,652 KB on desktop and 2,311 KB on mobile in October 2024. By contrast, YouTube’s own recommended bitrate table lists 1080p standard-frame-rate video at 8 Mbps. That works out to roughly 1 MB per second. In other words, about 2.3 to 2.6 seconds of sustained 1080p video transfer is already in the same ballpark as the entire transfer weight of today’s median web page. Flipbook itself acknowledges that live video streaming is resource-intensive.
Third, server-side inference still carries a real marginal cost. Modal’s public pricing page lists H100 at $0.001097 per second, or about $3.95 per hour. Real production systems will multiplex, batch, and optimize around that, but the underlying point stands: a pixel-stream interface consumes ongoing server-side inference resources in a way a normal HTML interface usually does not.
What is more likely to happen
The most plausible future is not total replacement. It is stratification.
Layer one will remain HTML-heavy. Transactions, forms, enterprise software, documentation, search, admin panels, and regulated workflows still benefit from deterministic structure, accessibility, inspectability, and low cost.
Layer two will be model-augmented HTML. The substrate stays structured, but a model handles summarization, re-layout, explanation, navigation, and personalization on top.
Layer three will be Flipbook-like model-native surfaces. These will be strongest where exploration, presentation, and synthesis matter more than rigid workflows: learning, sales demos, creative tools, visual search, interactive explainers, and early design.
That means HTML may become less visible even if it remains essential. Users may increasingly interact with a model-generated presentation layer first, while structured web infrastructure continues to do the durable work underneath.
Conclusion
Flipbook’s real provocation is not that it removes HTML from the screen. It is that it reverses the default order of interface design. The old web defines structure first and expression second. Model-native systems attempt expression first and derive interaction from it in real time.
That is a meaningful shift. But for it to become a dominant computing paradigm, three problems still have to be solved at once: factual reliability, accessibility, and cost. Until then, HTML is unlikely to disappear. What is more likely is that it retreats downward—less visible to users, still indispensable to the web.
Sources
- Original WeChat article: 51CTO技术栈
- Flipbook: flipbook.page
- Zain Shah on X: Flipbook announcement
- Lightricks LTX-Video GitHub: github.com/Lightricks/LTX-Video
- LTX-Video paper: arXiv:2501.00103
- LTX documentation: docs.ltx.video
- WHATWG: Maintaining and evolving HTML since 2004
- HTTP Archive Web Almanac 2024: Page Weight
- YouTube recommended upload encoding settings: 1080p standard-frame-rate bitrate guidance
- Modal Pricing: modal.com/pricing
More from WayDigital
Continue through other published articles from the same publisher.
Comments
0 public responses
All visitors can read comments. Sign in to join the discussion.
Log in to comment