Multimedia
See recent articles
Showing new listings for Thursday, 19 March 2026
- [1] arXiv:2603.16890 [pdf, html, other]
-
Title: Amanous: Distribution-Switching for Superhuman Piano Density on DisklavierSubjects: Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
The automated piano enables note densities, polyphony, and register changes far beyond human physical limits, yet the three dominant traditions for composing such textures--Nancarrow's tempo canons, Xenakis's stochastic distributions, and L-system grammars--have developed in isolation. This paper presents Amanous, a hardware-aware composition system for Yamaha Disklavier that unifies these methodologies through distribution-switching: L-system symbols select distinct distributional regimes rather than merely modulating parameters within a fixed family. Four contributions are reported. (1) A four-layer architecture (symbolic, parametric, numeric, physical) produces statistically distinct sections with large effect sizes (d = 3.70-5.34), validated by per-layer degradation and ablation experiments. (2) A hardware abstraction layer formalizes velocity-dependent latency and key reset constraints, keeping superhuman textures within the Disklavier's actuable envelope. (3) A density sweep reveals a computational saturation transition at 24-30 notes/s (bootstrap 95% CI: 23.3-50.0), beyond which single-domain melodic metrics lose discriminative power and cross-domain coupling becomes necessary. (4) A convergence point calculus operationalizes tempo-canon geometry as a control interface, enabling convergence events to trigger distribution switches linking macro-temporal structure to micro-level texture. All results are computational; a psychoacoustic validation protocol is proposed for future work. The pipeline has been deployed on a physical Disklavier, demonstrating algorithmic self-consistency and sub-millisecond software precision. Supplementary materials (Excerpts 1-4): this https URL. Source code: this https URL.
- [2] arXiv:2603.17347 [pdf, html, other]
-
Title: Beyond Forced Modality Balance: Intrinsic Information Budgets for Multimodal LearningComments: 6 pages, 4 figures, paper accepted by ICME 2026Subjects: Multimedia (cs.MM)
Multimodal models often converge to a dominant-modality solution, in which a stronger, faster-converging modality overshadows weaker ones. This modality imbalance causes suboptimal performance. Existing methods attempt to balance different modalities by reweighting gradients or losses. However, they overlook the fact that each modality has finite information capacity. In this work, we propose IIBalance, a multimodal learning framework that aligns the modality contributions with Intrinsic Information Budgets (IIB). We propose a task-grounded estimator of each modality's IIB, transforming its capacity into a global prior over modality contributions. Anchored by the highest-budget modality, we design a prototype-based relative alignment mechanism that corrects semantic drift only when weaker modalities deviate from their budgeted potential, rather than forcing imitation. During inference, we propose a probabilistic gating module that integrates the global budgets with sample-level uncertainty to generate calibrated fusion weights. Experiments on three representative benchmarks demonstrate that IIBalance consistently outperforms state-of-the-art balancing methods and achieves better utilization of complementary modality cues. Our code is available at: this https URL.
New submissions (showing 2 of 2 entries)
- [3] arXiv:2603.16966 (cross-list from cs.CV) [pdf, html, other]
-
Title: CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker DiarizationComments: Accepted to CVPR 2026Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.
Cross submissions (showing 1 of 1 entries)
- [4] arXiv:2508.20476 (replaced) [pdf, html, other]
-
Title: Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and AudioSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such audio-centric systems inherently exclude individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Furthermore, our analysis reveals a key linguistic insight: explicitly modeling lip movements as a distinct modality significantly improves SLT performance by capturing critical non-manual cues.
- [5] arXiv:2602.07768 (replaced) [pdf, html, other]
-
Title: PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual ClassificationComments: 6pages, 3 figures, conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at this https URL.