cs.CV（2024-12-13）

📊 共 39 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (17 🔗5) 支柱二：RL算法与架构 (RL & Architecture) (8) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱四：生成式动作 (Generative Motion) (4 🔗2) 支柱八：物理动画 (Physics-based Animation) (1) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (17 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation	提出Simignore以提升多模态大语言模型的复杂推理能力	large language model multimodal chain-of-thought	✅
2	EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing	EVLM：通过自反思多模态推理实现跨维度视觉编辑	multimodal chain-of-thought
3	DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	DeepSeek-VL2：面向高级多模态理解的混合专家视觉语言模型	multimodal visual grounding	✅
4	DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts	DEFAME：提出基于动态证据和多模态专家的事实核查框架，显著提升文本图像混合场景下的核查性能。	multimodal
5	Apollo: An Exploration of Video Understanding in Large Multimodal Models	Apollo：探索大规模多模态模型中的视频理解能力，并提出高效训练策略。	multimodal
6	Robust image classification with multi-modal large language models	提出MultiShield，利用多模态大语言模型提升图像分类模型对抗攻击的鲁棒性。	large language model
7	CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information	CognitionCapturer：利用多模态信息从人脑EEG信号中解码视觉刺激	multimodal	✅
8	Learning Complex Non-Rigid Image Edits from Multimodal Conditioning	提出基于多模态条件控制的图像编辑方法，实现人物插入和姿态编辑。	multimodal
9	A multimodal dataset for understanding the impact of mobile phones on remote online virtual education	IMPROVE：一个用于理解手机使用对远程在线教育影响的多模态数据集	multimodal
10	B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens	提出B-VLLM以解决长视频理解中的视觉令牌数量问题	large language model	✅
11	All-in-One: Transferring Vision Foundation Models into Stereo Matching	AIO-Stereo：将视觉基础模型迁移至立体匹配，实现性能突破	foundation model
12	Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining	Iris：通过自适应聚焦和自精炼打破GUI复杂性的视觉Agent	large language model multimodal
13	BrushEdit: All-In-One Image Inpainting and Editing	提出BrushEdit，一种基于图像修复的交互式指令图像编辑框架	large language model multimodal
14	Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics	提出UniSim-Bench多模态感知指标评测基准，并探索统一的多模态感知模型。	multimodal	✅
15	Single-Pass Object-Focused Data Selection	提出对象聚焦数据选择方法以优化标注预算	foundation model
16	Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction	提出CoVLA框架，解决多模态社交媒体语义位置预测中的歧义与差异问题	multimodal
17	CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection	CP-DETR：通过概念提示引导DETR实现更强大的通用目标检测	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
18	SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians	SuperGSeg：利用结构化超高斯实现开放词汇3D分割	distillation 3D gaussian splatting gaussian splatting
19	MulSMo: Multimodal Stylized Motion Generation by Bidirectional Control Flow	MulSMo：通过双向控制流实现多模态风格化运动生成。	contrastive learning motion generation multimodal
20	Predictive Modeling, Pattern Recognition, and Spatiotemporal Representations of Plant Growth in Simulated and Controlled Environments: A Comprehensive Review	综述植物生长时空建模方法，聚焦预测、模式识别与环境交互。	predictive model spatiotemporal
21	XYScanNet: A State Space Model for Single Image Deblurring	提出XYScanNet，利用状态空间模型和切片扫描策略进行单图像去模糊。	Mamba SSM state space model
22	A dual contrastive framework	AlignCap：提出双重对比框架，增强区域级视觉理解能力，提升区域描述性能。	contrastive learning multimodal
23	Towards Consistent and Efficient Dataset Distillation via Diffusion-Driven Selection	提出基于扩散驱动选择的数据集蒸馏方法，提升一致性和效率。	distillation
24	Going Beyond Feature Similarity: Effective Dataset Distillation based on Class-Aware Conditional Mutual Information	提出基于类感知条件互信息的数据集精馏方法，提升训练效率和性能。	distillation
25	Selective State Space Memory for Large Vision-Language Models	提出SSMI，通过选择性状态空间记忆高效微调大型视觉语言模型	Mamba multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
26	NeRF-Texture: Synthesizing Neural Radiance Field Textures	提出NeRF-Texture，通过神经辐射场合成具有三维中尺度结构的纹理。	NeRF neural radiance field implicit representation
27	Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation	提出Prompt引导的掩码提议网络PMP，用于两阶段开放词汇分割。	open-vocabulary open vocabulary
28	TSGaussian: Semantic and Depth-Guided Target-Specific Gaussian Splatting from Sparse Views	TSGaussian：结合语义与深度先验的稀疏视角目标特定高斯溅射重建	gaussian splatting splatting	✅
29	SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video	提出SplineGS，通过运动自适应样条曲线实现单目视频实时动态3D高斯重建。	3D gaussian splatting 3DGS gaussian splatting
30	GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion	提出GAF，利用多视角扩散模型从单目视频重建高逼真度可动画3D高斯头像	gaussian splatting splatting
31	Sharpening Your Density Fields: Spiking Neuron Aided Fast Geometry Learning	提出基于脉冲神经元的NeRF快速几何学习方法，提升几何提取效率。	NeRF neural radiance field
32	Coherent 3D Scene Diffusion From a Single RGB Image	提出一种基于扩散模型的单RGB图像三维场景连贯重建方法	scene reconstruction
33	A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization	提出可微波光学模型，用于端到端计算成像系统优化	scene reconstruction

🔬 支柱四：生成式动作 (Generative Motion) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
34	EP-CFG: Energy-Preserving Classifier-Free Guidance	提出EP-CFG，通过能量保持解决扩散模型中CFG过饱和和过对比问题	classifier-free guidance
35	The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion	提出统一 verbal 和 non-verbal 语言的 3D 人体运动生成框架	motion generation multimodal
36	VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization	VQTalker：通过面部运动 Token 化实现多语种口型生成	motion generation motion tokenizer	✅
37	Quaffure: Real-Time Quasi-Static Neural Hair Simulation	Quaffure：实时准静态神经头发模拟，提升虚拟形象真实感	physically plausible	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
38	Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP	提出STDD框架，利用时空动态专家知识图谱提升CLIP在零样本行为识别中的性能	spatiotemporal

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
39	SVGBuilder: Component-Based Colored SVG Generation with Text-Guided Autoregressive Transformers	SVGBuilder：基于文本引导的自回归Transformer的组件化彩色SVG生成	manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页