Meet CMMMU: A New Chinese Massive Multi-Discipline Multimodal Understanding Benchmark Designed to Evaluate Large Multimodal Models LMMs

In the realm of artificial intelligence, Large Multimodal Models (LMMs) have exhibited remarkable problem-solving capabilities across diverse tasks, such as zero-shot image/video classification, zero-shot image/video-text retrieval, and multimodal question answering (QA). However, recent studies highlight a substantial gap between powerful LMMs and expert-level artificial intelligence, particularly in tasks involving complex perception and reasoning with domain-specific…

DeepSeek-AI Introduce the DeepSeek-Coder Series: A Range of Open-Source Code Models from 1.3B to 33B and Trained from Scratch on 2T Tokens

In the dynamic field of software development, integrating large language models (LLMs) has initiated a new chapter, especially in code intelligence. These sophisticated models have been pivotal in automating various aspects of programming, from identifying bugs to generating code, revolutionizing how coding tasks are approached and executed. The impact of these models is vast, offering…

This AI Paper from China Introduces ‘AGENTBOARD’: An Open-Source Evaluation Framework Tailored to Analytical Evaluation of Multi-Turn LLM Agents

Evaluating LLMs as versatile agents is crucial for their integration into practical applications. However, existing evaluation frameworks face challenges in benchmarking diverse scenarios, maintaining partially observable environments, and capturing multi-round interactions. Current assessments often focus on a simplified final success rate metric, providing limited insights into the complex processes. The complexity of agent tasks, involving…

Researchers from the Chinese University of Hong Kong and Tencent AI Lab Propose a Multimodal Pathway to Improve Transformers with Irrelevant Data from Other Modalities

Transformers have found widespread application in diverse tasks spanning text classification, map construction, object detection, point cloud analysis, and audio spectrogram recognition. Their versatility extends to multimodal tasks, exemplified by CLIP’s use of image-text pairs for superior image recognition. This underscores transformers’ efficacy in establishing universal sequence-to-sequence modeling, creating embeddings that unify data representation across…

Meet BiTA: An Innovative AI Method Expediting LLMs via Streamlined Semi-Autoregressive Generation and Draft Verification

Large language models (LLMs) based on transformer architectures have emerged in recent years. Models such as Chat-GPT and LLaMA-2 demonstrate how the parameters of LLMs have rapidly increased, ranging from several billion to tens of trillions. Although LLMs are very good generators, they have trouble with inference delay since there is a lot of computing…

A New Research Study from the University of Surrey Shows Artificial Intelligence Could Help Power Plants Capture Carbon Ising 36% Less Energy from the Grid

Artificial intelligence is widely useful in environment-related fields. Recently, there has been increasing research on using AI in carbon capture technology. Carbon capture technology is critical in tackling climate change by trapping carbon dioxide (CO2) emissions from power plants. However, the current carbon capture systems are inefficient and may consume significant energy. Consequently, researchers from…

UC Berkeley and UCSF Researchers Propose Cross-Attention Masked Autoencoders (CrossMAE): A Leap in Efficient Visual Data Processing

One of the more intriguing developments in the dynamic field of computer vision is the efficient processing of visual data, which is essential for applications ranging from automated image analysis to the development of intelligent systems. A pressing challenge in this area is interpreting complex visual information, particularly in reconstructing detailed images from partial data….

Deciphering Neuronal Universality in GPT-2 Language Models

As Large Language Models (LLMs) gain prominence in high-stakes applications, understanding their decision-making processes becomes crucial to mitigate potential risks. The inherent opacity of these models has fueled interpretability research, leveraging the unique advantages of artificial neural networks—being observable and deterministic—for empirical scrutiny. A comprehensive understanding of these models not only enhances our knowledge but…

Meet WebVoyager: An Innovative Large Multimodal Model (LMM) Powered Web Agent that can Complete User Instructions End-to-End by Interacting with Real-World Websites

Existing web agents face limitations that stem from the fact that these agents often rely on a single input modality and are tested in controlled environments, like web simulators or static snapshots, which do not accurately reflect the complexity and dynamic nature of real-world web interactions. This significantly restricts their applicability and effectiveness in real-world…

This AI Paper from China Sheds Light on the Vulnerabilities of Vision-Language Models: Unveiling RTVLM, the First Red Teaming Dataset for Multimodal AI Security

Vision-Language Models (VLMs) are Artificial Intelligence (AI) systems that can interpret and comprehend visual and written inputs. Incorporating Large Language Models (LLMs) into VLMs has enhanced their comprehension of intricate inputs. Though VLMs have made encouraging development and gained significant popularity, there are still limitations regarding their effectiveness in difficult settings. The core of VLMs,…