cs.CV（2024-11-26）

📊 共 55 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (27 🔗7) 支柱三：空间感知与语义 (Perception & Semantics) (12 🔗5) 支柱二：RL算法与架构 (RL & Architecture) (11 🔗4) 支柱一：机器人控制 (Robot Control) (2) 支柱八：物理动画 (Physics-based Animation) (2) 支柱四：生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (27 篇)

#	题目	一句话要点	标签	🔗
1	Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis	Visatronic：一种用于语音合成的多模态解码器模型，实现视频-文本到语音的生成。	large language model foundation model multimodal	✅
2	InsightEdit: Towards Better Instruction Following for Image Editing	InsightEdit：利用多模态大语言模型提升指令驱动的图像编辑效果	large language model multimodal instruction following
3	Multimodal Alignment and Fusion: A Survey	综述多模态对齐与融合技术，涵盖结构视角与方法范式，旨在提升多模态学习系统的泛化性。	embodied AI large language model multimodal
4	NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?	NEMO：评估多模态大语言模型识别属性修改对象的能力	large language model multimodal
5	ShowUI: One Vision-Language-Action Model for GUI Visual Agent	提出ShowUI，一个用于GUI视觉代理的视觉-语言-动作模型	vision-language-action instruction following	✅
6	Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment	提出Grounding-IQA，通过多模态 grounding 提升图像质量评估的细粒度。	large language model multimodal	✅
7	Real-Time Multimodal Signal Processing for HRI in RoboCup: Understanding a Human Referee	针对RoboCup人机交互，提出实时多模态信号处理方法以理解人类裁判	multimodal
8	Video-Guided Foley Sound Generation with Multimodal Controls	MultiFoley：多模态控制的视频引导Foley音效生成模型	multimodal	✅
9	HyperSeg: Towards Universal Visual Segmentation with Large Language Model	HyperSeg：基于大语言模型的通用视觉分割模型，实现图像和视频的像素级理解	large language model
10	Multimodal Outer Arithmetic Block Dual Fusion of Whole Slide Images and Omics Data for Precision Oncology	提出基于双重融合的多模态外积算术块方法，提升WSI与基因组学数据融合的肿瘤亚型诊断精度。	multimodal
11	Efficient Multi-modal Large Language Models via Visual Token Grouping	提出VisToG，通过视觉Token分组提升多模态大语言模型效率	large language model
12	Exploring Aleatoric Uncertainty in Object Detection via Vision Foundation Models	利用视觉基础模型探索目标检测中的偶然不确定性，提升模型鲁棒性	foundation model
13	Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos	评估大型语言模型在文本、图像和视频中检测敏感内容的能力，提升内容审核效果。	large language model
14	SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery	SatVision-TOA：用于粗分辨率全天候遥感影像的地理空间基础模型	foundation model
15	Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration	提出FiCoCo框架，通过无训练Token缩减加速多模态大语言模型	large language model multimodal	✅
16	SketchAgent: Language-Driven Sequential Sketch Generation	SketchAgent：提出一种基于语言驱动的序列化草图生成方法	large language model multimodal
17	HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator	提出HEIE：基于MLLM的分层可解释AIGC图像不合理性评估器	large language model multimodal	✅
18	DOGR: Towards Versatile Visual Document Grounding and Referring	DOGR：面向通用视觉文档定位与指代的模型、数据引擎与评测基准	large language model multimodal
19	OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection	OpenAD：用于3D目标检测的开放世界自动驾驶基准测试	large language model multimodal	✅
20	The Context of Crash Occurrence: A Complexity-Infused Approach Integrating Semantic, Contextual, and Kinematic Features	提出融合语义、上下文和运动学特征的道路复杂性分析框架，用于提升交通事故预测精度。	large language model
21	Bi-ICE: An Inner Interpretable Framework for Image Classification via Bi-directional Interactions between Concept and Input Embeddings	提出Bi-ICE，通过概念与输入嵌入的双向交互，提升图像分类的内部可解释性。	large language model
22	Scene Co-pilot: Procedural Text to Video Generation with Human in the Loop	提出Scene Co-pilot框架，结合LLM与程序化3D场景生成，实现可控的文本到视频生成。	large language model
23	FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval	提出FLEX-CLIP，通过特征生成网络增强CLIP，解决X-shot跨模态检索中的特征退化和数据不平衡问题。	multimodal
24	VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models	提出VL-RewardBench，用于评估和提升视觉-语言生成奖励模型	multimodal
25	in-Car Biometrics (iCarB) Datasets for Driver Recognition: Face, Fingerprint, and Voice	发布iCarB车载生物识别数据集，用于驾驶员身份识别，包含人脸、指纹和语音三种模态。	multimodal
26	Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation	Reflect3D：利用单图像对称性检测实现高质量3D生成	foundation model
27	MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding	MUSE-VL：通过语义离散编码建模统一的视觉-语言模型	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

#	题目	一句话要点	标签	🔗
28	DROID-Splat: Combining end-to-end SLAM with 3D Gaussian Splatting	DroidSplat：结合端到端SLAM与3D高斯溅射，实现SOTA级跟踪与渲染。	monocular depth 3D gaussian splatting gaussian splatting	✅
29	Distractor-free Generalizable 3D Gaussian Splatting	提出DGGS，解决跨场景泛化3D高斯溅射中无干扰物体的重建问题	3D gaussian splatting 3DGS gaussian splatting	✅
30	4D Scaffold Gaussian Splatting with Dynamic-Aware Anchor Growing for Efficient and High-Fidelity Dynamic Scene Reconstruction	提出基于动态感知Anchor生长的4D骨架高斯溅射，用于高效高保真动态场景重建	gaussian splatting splatting scene reconstruction
31	SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting	SelfSplat：提出一种无需位姿和3D先验的可泛化3D高斯溅射方法	3D gaussian splatting gaussian splatting splatting	✅
32	Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation	提出基于谱图蒸馏的对象上下文感知开放词汇语义分割方法	open-vocabulary open vocabulary foundation model
33	HSI-Drive v2.0: More Data for New Challenges in Scene Understanding for Autonomous Driving	HSI-Drive v2.0：扩展高光谱图像数据集，提升自动驾驶场景理解能力	scene understanding HSI
34	MLI-NeRF: Multi-Light Intrinsic-Aware Neural Radiance Fields	提出MLI-NeRF，利用多光源信息解决NeRF中固有图像分解难题。	NeRF neural radiance field
35	DepthCues: Evaluating Monocular Depth Perception in Large Vision Models	DepthCues：评估大型视觉模型中的单目深度感知能力	depth estimation monocular depth
36	Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors	提出基于生成隐变量先验的内窥镜自监督单目深度与姿态估计方法	monocular depth
37	Puzzle Similarity: A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions	提出Puzzle Similarity，用于3D重建中无参考伪影检测，提升重建质量。	scene reconstruction	✅
38	Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors	Buffer Anytime：利用图像先验实现零样本视频深度和法线估计	Depth Anything optical flow
39	Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning	提出BoMBo策略，利用弱监督损失进行多任务部分监督学习，提升目标检测与语义分割性能。	scene understanding	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (11 篇)

#	题目	一句话要点	标签	🔗
40	VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension	VersatileMotion：统一的多模态运动LLM框架，实现运动合成与理解	flow matching motion synthesis motion tokenizer
41	FTMoMamba: Motion Generation with Frequency and Text State Space Models	FTMoMamba：利用频率和文本状态空间模型进行运动生成	Mamba state space model text-to-motion
42	BadScan: An Architectural Backdoor Attack on Visual State Space Models	BadScan：针对视觉状态空间模型的架构后门攻击	Mamba SSM state space model
43	Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation	提出一种基于模态蒸馏的鲁棒Anymodal分割器，解决多模态分割中的单模态偏见问题。	distillation multimodal
44	D$^2$-World: An Efficient World Model through Decoupled Dynamic Flow	D$^2$-World：通过解耦动态流高效预测未来点云	world model foundation model	✅
45	SVGDreamer++: Advancing Editability and Diversity in Text-Guided SVG Generation	SVGDreamer++：提出HIVE和VPSD，提升文本引导SVG生成的可编辑性和多样性	dreamer distillation	✅
46	Spatially Visual Perception for End-to-End Robotic Learning	提出基于空间感知的端到端机器人学习框架，提升光照变化下的泛化能力	imitation learning depth estimation monocular depth
47	TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba	TinyViM：通过频率解耦实现Tiny混合视觉Mamba模型，提升性能并加速推理。	Mamba	✅
48	DWCL: Dual-Weighted Contrastive Learning for Multi-View Clustering	提出双重加权对比学习(DWCL)用于解决多视图聚类中的表示退化和不可靠视图问题。	contrastive learning
49	Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation	提出MUSE：通过多分辨率数据生成实现ImageNet大规模无数据知识蒸馏	distillation	✅
50	Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models	提出语义锚点迁移(SAT)方法，解决视觉-语言模型在测试时自适应的鲁棒性问题。	representation learning contrastive learning distillation

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
51	vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation	提出vesselFM，用于通用三维血管分割的医学图像基础模型	domain randomization flow matching foundation model
52	GMFlow: Global Motion-Guided Recurrent Flow for 6D Object Pose Estimation	提出GMFlow：全局运动引导的循环光流用于6D物体姿态估计	manipulation linear attention

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
53	AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM	提出AIGV-Assessor，利用LMM评估文本生成视频的感知质量，并构建大规模AIGVQA-DB数据集。	spatiotemporal multimodal
54	Selfish Evolution: Making Discoveries in Extreme Label Noise with the Help of Overfitting Dynamics	提出Selfish Evolution，利用过拟合动态在极端标签噪声下进行发现与纠正。	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
55	I2VControl: Disentangled and Unified Video Motion Synthesis Control	I2VControl：解耦统一的视频运动合成控制框架，实现多类型控制无冲突融合	motion synthesis	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2024-11-26）

🎯 兴趣领域导航

🔬 支柱九：具身大模型 (Embodied Foundation Models) (27 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

🔬 支柱二：RL算法与架构 (RL & Architecture) (11 篇)

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理