cs.CV（2025-10-15）

📊 共 46 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (18 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (9 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱八：物理动画 (Physics-based Animation) (3) 支柱四：生成式动作 (Generative Motion) (3) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (18 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Risk-adaptive Activation Steering for Safe Multimodal Large Language Models	提出风险自适应激活引导(RAS)方法，提升多模态大语言模型安全性并加速推理。	large language model multimodal
2	Model-agnostic Adversarial Attack and Defense for Vision-Language-Action Models	提出针对视觉-语言-动作模型的模型无关对抗攻击与防御方法	vision-language-action VLA	✅
3	Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs	提出Honey-Data-15M数据集和Bee-8B模型，提升全开源多模态大语言模型性能。	large language model multimodal chain-of-thought
4	Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark	提出Uni-MMMU：一个大规模多学科多模态统一基准，用于评估视觉理解与生成模型的双向协同能力。	multimodal
5	Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues	提出一种条件感知的动态融合方法，用于解决无人机多模态目标检测在复杂场景下的鲁棒性问题。	multimodal
6	Language as a Label: Zero-Shot Multimodal Classification of Everyday Postures under Data Scarcity	利用语言标签进行零样本多模态分类，解决数据稀缺下的日常姿态识别问题	multimodal
7	OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment	提出OS-HGAdapter，利用大语言模型增强图像-文本对齐，显著提升跨模态检索性能。	large language model
8	Reasoning in Space via Grounding in the World	提出基于世界感知的Grounded-Spatial Reasoner，用于提升3D空间推理能力。	visual grounding chain-of-thought
9	RECODE: Reasoning Through Code Generation for Visual Question Answering	提出RECODE框架，通过代码生成实现视觉问答中更精确的可验证推理。	large language model multimodal
10	OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild	OmniGaze：提出奖励驱动的通用凝视估计框架，解决野外场景泛化性问题	large language model multimodal
11	Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding	提出Vgent，通过图结构检索-推理增强生成，提升长视频理解能力。	large language model	✅
12	Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation	提出AVC框架，自适应视觉条件控制扩散模型，提升故事延续生成语义一致性。	large language model
13	InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue	提出InteractiveOmni，一个用于音视频多轮交互的统一全模态大语言模型。	large language model
14	Towards Adversarial Robustness and Uncertainty Quantification in DINOv2-based Few-Shot Anomaly Detection	研究DINOv2在少样本异常检测中的对抗鲁棒性和不确定性量化问题	foundation model
15	Visual Interestingness Decoded: How GPT-4o Mirrors Human Interests	探索GPT-4o对视觉趣味性的理解，并用于提升学习排序模型	multimodal
16	Self-Augmented Visual Contrastive Decoding	提出自增强视觉对比解码，提升大型视觉语言模型的事实一致性	multimodal
17	MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models	提出MMLongCite基准，评估长上下文视觉语言模型的信息保真度	multimodal
18	What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging	提出NegToMe模块和CoVAND数据集，提升VLM在否定描述对象检测中的性能	chain-of-thought

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
19	PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning	提出PhysMaster，通过强化学习物理表征，提升视频生成模型的物理合理性。	reinforcement learning DPO direct preference optimization
20	XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation	XD-RCDepth：面向自动驾驶，提出轻量级雷达-相机深度估计与可解释性对齐的知识蒸馏方法	MAE distillation depth estimation
21	Generative Universal Verifier as Multimodal Meta-Reasoner	提出生成式通用验证器，赋能多模态模型进行视觉结果反思与优化。	world model multimodal
22	UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning	UniME-V2：利用MLLM作为判别器进行通用多模态嵌入学习	representation learning multimodal
23	Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment	提出GLSDA框架，利用大模型语义知识提升WiFi手势识别泛化能力	representation learning distillation foundation model
24	End-to-End Multi-Modal Diffusion Mamba	提出多模态扩散Mamba（MDM），用于统一多模态处理并提升生成性能。	Mamba representation learning MDM
25	Scaling Vision Transformers for Functional MRI with Flat Maps	利用平面图和视觉Transformer扩展功能磁共振成像研究	masked autoencoder MAE spatiotemporal	✅
26	Shortcutting Pre-trained Flow Matching Diffusion Models is Almost Free Lunch	提出基于速度场自蒸馏的Flow Matching模型加速方法，实现高效少步采样	flow matching distillation
27	Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning	提出知识引导对比学习框架以解决开放域视觉实体识别问题	contrastive learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
28	FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding	FlyAwareV2：用于城市场景理解的多模态跨域无人机数据集	depth estimation monocular depth scene understanding
29	InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation	InsideOut：集成RGB与辐射高斯溅射的综合3D物体表示	3D gaussian splatting 3DGS gaussian splatting
30	STT-GS: Sample-Then-Transmit Edge Gaussian Splatting with Joint Client Selection and Power Control	提出STT-GS边缘高斯溅射方法，联合优化客户端选择和功率控制，提升低空场景重建质量。	gaussian splatting splatting scene reconstruction
31	Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering	提出结合2D先验与SDF引导的动态城市场景渲染方法	3D gaussian splatting 3DGS gaussian splatting
32	Accelerated Feature Detectors for Visual SLAM: A Comparative Study of FPGA vs GPU	对比FPGA与GPU加速的特征检测器在视觉SLAM中的性能与能效	visual SLAM
33	Removing Cost Volumes from Optical Flow Estimators	提出一种训练策略，可在光流估计中移除代价体，显著提升推理速度并降低内存占用。	optical flow
34	Capture, Canonicalize, Splat: Zero-Shot 3D Gaussian Avatars from Unstructured Phone Images	提出Capture, Canonicalize, Splat零样本3D高斯头像生成方法	gaussian splatting splatting
35	InstantSfM: Fully Sparse and Parallel Structure-from-Motion	InstantSfM：全稀疏并行Structure-from-Motion，加速大规模场景重建。	VGGT	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
36	DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning	DepthVLA：通过深度感知的空间推理增强视觉-语言-动作模型	manipulation vision-language-action VLA
37	Trace Anything: Representing Any Video in 4D via Trajectory Fields	Trace Anything：提出基于轨迹场的视频4D表示方法，实现高效时空建模。	manipulation	✅
38	NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models	提出NoisePrints，一种用于私有扩散模型中无失真水印的作者身份验证方法	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
39	EPIPTrack: Rethinking Prompt Modeling with Explicit and Implicit Prompts for Multi-Object Tracking	EPIPTrack：利用显式和隐式提示进行多目标跟踪的提示建模新方法	spatiotemporal large language model multimodal
40	Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs	揭示VideoLLM信息流动路径：通过机制可解释性分析时序推理过程	spatiotemporal large language model
41	Real-Time Sign Language to text Translation using Deep Learning: A Comparative study of LSTM and 3D CNN	对比LSTM与3D CNN，实现实时手语到文本的深度学习翻译	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
42	MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation	MimicParts：用于语音驱动3D人体动作生成的部件感知风格注入方法	motion generation
43	CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas	CanvasMAR：通过画布机制改进掩码自回归视频生成，解决慢启动和误差累积问题。	classifier-free guidance
44	Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models	提出Group-VQ，通过分组优化自扩展码书解决VQ-VAE中的码书坍塌问题	VQ-VAE

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
45	MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion	MVCustom：通过几何潜在渲染和补全实现多视角定制化扩散模型	geometric consistency

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
46	VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models	VisCoP：通过视觉探针实现视觉语言模型在视频领域的域自适应	egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页