Race to Build Vision-Language AI Heats Up in Korea Amid Global Multimodal Boom

A visually impaired person hailing a taxi using GPT-4o (Screenshot from Open AI’s YouTube channel)

SEOUL, May 19 (Korea Bizwire) — As tech giants worldwide accelerate development of vision-language models (VLMs)—AI systems capable of understanding images, text, and speech simultaneously—South Korean firms are joining the global race with new multimodal technologies of their own.

The technology gained public attention when OpenAI demonstrated its GPT-4o model last year by helping a visually impaired person hail a taxi. By interpreting real-time video and responding via voice, the AI model highlighted the powerful potential of multimodal learning, where visual and linguistic data are processed together to understand context and environment.

VLMs are already being applied across sectors such as healthcare, e-commerce, education, and tourism—generating marketing content from store images, assisting doctors in analyzing X-rays, or generating tour descriptions from photos.

However, the rapid evolution of this technology has also raised concerns. OpenAI faced backlash after GPT-4o’s voice was found to closely resemble that of actress Scarlett Johansson, prompting the company to suspend the feature. Experts warn that VLMs trained on mixed data sets could be exploited for identity inference, voice mimicry, and fake content creation.

Despite the risks, South Korean companies are pressing forward.

On May 16, Naver’s HyperCLOVA X SEED 3B, a lightweight open-source VLM, surpassed 120,000 downloads on Hugging Face, a leading global developer platform. It is Naver’s first generative AI released to the open-source community and is capable of processing text, images, and video. Designed for Korean language context, the model supports chart analysis, object recognition, and image-based Q&A.

A scene from the film Her (Screenshot from Universal Picture’s YouTube channel)

“We’re seeing strong feedback from the open-source community, especially because the model is optimized for Korean,” a Naver representative said.

Meanwhile, Kakao introduced two new models this month: Canana-a, which understands audio and text, and Canana-o, which integrates visual and audio inputs. Kakao claims Canana-o performs at a level comparable to top global models in English and outperforms them in Korean.

AI startup Twelve Labs plans to launch its video-focused multimodal models, Marengo and Pegasus, on Amazon Bedrock, marking the first such deployment for a Korean AI firm on the platform.

Game developer NCSoft has also released VARCO Vision, a lightweight open-source VLM optimized for Korean-language tasks.

According to Koh Sam-seok, a professor at Dongguk University’s College of AI Convergence, VLMs are becoming a defining trend in global AI development. “Major players like Naver and Samsung must push forward with proprietary large language models, while SMEs should build services by adapting open-source models,” he said.

As AI becomes more immersive and multimodal, the next generation of Korean tech companies is preparing not only to keep pace—but to lead.

Kevin Lee (kevinlee@koreabizwire.com)