Internship - Contrastive Multimodal Pretraining for Noise-aware Diffusion-based Audio-visual Speech Enhancement

il y a 4 heures


VillerslèsNancy, Grand Est, France Inria Temps plein

Le descriptif de l'offre ci-dessous est en Anglais

Type de contrat : Convention de stage

Niveau de diplôme exigé : Bac + 4 ou équivalent

Fonction : Stagiaire de la recherche

Contexte et atouts du poste

This master internship is part of the REAVISE project: "Robust and Efficient Deep Learning based Audiovisual Speech Enhancement" funded by the French National Research Agency (ANR). The general objective of REAVISE is to develop a unified audio-visual speech enhancement (AVSE) framework that leverages recent methodological breakthroughs in statistical signal processing, machine learning, and deep neural networks in order to design a robust and efficient AVSE framework.

The intern will be supervised by Mostafa Sadeghi (researcher, Inria), Romain Serizel (associate professor, University of Lorraine), as members of the MULTISPEECH team, and Xavier Alameda-Pineda (Inria Grenoble), member of the RobotLearn team. The intern will benefit from the research environment, expertise, and powerful computational resources (GPUs & CPUs) of the team.

Mission confiée

Diffusion models represent a cutting-edge class of generative models highly effective in modeling natural data, such as images and audio [1]. These models function through a forward (noising) process that incrementally transforms training data into Gaussian noise, paired with a reverse (denoising) process that reconstructs the original data point from noise. Recently, diffusion models have demonstrated promising performance for unsupervised speech enhancement [2]. By leveraging these models as data-driven priors for clean speech, they enable the enhancement of noisy speech data by estimating clean speech through posterior sampling, effectively separating it from background noise. Additionally, the integration of video as conditioning information into the speech model further augments the enhancement capability, utilizing visual cues from the target speaker to improve the performance [3]. This approach underscores the potential of combining audio and visual data to improve speech quality, especially in highly noisy environments.

Contrastive learning further extends the functionality of multimodal integration, as evidenced by models like CLIP (Contrastive Language–Image Pre-training) [4] and CLAP (Contrastive Language–Audio Pre-training) [5], which bridge disparate modalities such as text with image and audio. These models create a shared multimodal embedding space that supports various applications, from text-to-image generation to sophisticated audio processing tasks. Although these models have been used in some audio tasks like source separation [6], generation [7], classification [8] or localization [9], their application in audio-visual speech enhancement is highly under-explored.

Principales activités

The primary objective of this project is to refine and expand the capabilities of audio-visual speech enhancement through the strategic incorporation of additional modal information into the noise model. By utilizing either textual descriptions or visual representations of the noise environment, such as videos or images depicting the acoustic scene, we aim to enhance the model's ability to identify and differentiate noise sources effectively. This would involve developing robust contrastive learning techniques to manage the discrepancies between training and testing conditions, such as training with textual noise descriptions and testing with visual data, thanks to the shared multimodal embedding space.

To address these challenges, we propose to:

  • Develop a contrastive learning framework that can dynamically adapt to different modalities of noise information, ensuring that the system remains effective regardless of the variability in available data type at training and test times.
  • Utilize the shared embedding space learned through contrastive methods as conditioning information for the noise model to improve the performance of speech enhancement systems, making them more adaptable and effective in diverse and noisy environments.

References

[1] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, Score-Based Generative Modeling through Stochastic Differential Equations In International Conference on Learning Representations.

[2] B. Nortier, M. Sadeghi, and R. Serizel, Unsupervised speech enhancement with diffusion-based generative models In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.

[3] J.-E. Ayilo, M. Sadeghi, R. Serizel, and X. Alameda-Pineda, Diffusion-based Unsupervised Audio-visual Speech Enhancement HAL preprint hal , 2024.

[4] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, et al., Learning transferable visual models from natural language supervision In International Conference on Machine Learning, pp , PMLR, 2021.

[5] B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, Clap learning audio concepts from natural language supervision In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, IEEE, 2023.

[6] X. Liu, Q. Kong, Y. Zhao, H. Liu, Y. Yuan, Y. Liu, and W. Wang Separate anything you describe arXiv preprint arXiv :

[7] Y. Yuan, H. Liu, X. Liu, X. Kang, P. Wu, M. D. Plumbley, and W. Wang, Text-driven foley sound generation with latent diffusion model arXiv preprint arXiv :

[8] A. Guzhov, F. Raue, J. Hees, A. Dengel Audioclip: Extending clip to image, text and audio. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022.

[9] T. Mahmud, and D. Marculescu Ave-clip: Audioclip-based multi-window temporal transformer for audio visual event localization In IEEE/CVF Winter Conference on Applications of Computer Vision 2023.

Compétences

Preferred qualifications for candidates include a strong foundation in statistical (speech) signal processing, and computer vision, as well as expertise in machine learning and proficiency with deep learning frameworks, particularly PyTorch.

Avantages
  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage
Rémunération

€ 4.35/hour

Informations générales
  • Thème/Domaine : Langue, parole et audio

Calcul Scientifique (BAP E)
- Ville : Villers lès Nancy
- Centre Inria : Centre Inria de l'Université de Lorraine
- Date de prise de fonction souhaitée :
- Durée de contrat : 6 mois
- Date limite pour postuler :

Attention: Les candidatures doivent être déposées en ligne sur le site Inria. Le traitement des candidatures adressées par d'autres canaux n'est pas garanti.

Consignes pour postuler

Sécurité défense :

Ce poste est susceptible d'être affecté dans une zone à régime restrictif (ZRR), telle que définie dans le décret n° relatif à la protection du potentiel scientifique et technique de la nation (PPST). L'autorisation d'accès à une zone est délivrée par le chef d'établissement, après avis ministériel favorable, tel que défini dans l'arrêté du 03 juillet 2012, relatif à la PPST. Un avis ministériel défavorable pour un poste affecté dans une ZRR aurait pour conséquence l'annulation du recrutement.

Politique de recrutement :

Dans le cadre de sa politique diversité, tous les postes Inria sont accessibles aux personnes en situation de handicap.

Contacts
  • Équipe Inria : MULTISPEECH
  • Recruteur :

Sadeghi Mostafa /

L'essentiel pour réussir

Prospective applicants are invited to submit their academic transcripts, a detailed curriculum vitae (CV), and, if they choose, a cover letter. The cover letter should highlight the reasons for their enthusiasm and interest in this specific project.

A propos d'Inria

Inria est l'institut national de recherche dédié aux sciences et technologies du numérique. Il emploie 2600 personnes. Ses 215 équipes-projets agiles, en général communes avec des partenaires académiques, impliquent plus de 3900 scientifiques pour relever les défis du numérique, souvent à l'interface d'autres disciplines. L'institut fait appel à de nombreux talents dans plus d'une quarantaine de métiers différents. 900 personnels d'appui à la recherche et à l'innovation contribuent à faire émerger et grandir des projets scientifiques ou entrepreneuriaux qui impactent le monde. Inria travaille avec de nombreuses entreprises et a accompagné la création de plus de 200 start-up. L'institut s'efforce ainsi de répondre aux enjeux de la transformation numérique de la science, de la société et de l'économie.



  • Villers-lès-Nancy, Grand Est, France Inria Temps plein

    Le descriptif de l'offre ci-dessous est en AnglaisType de contrat : Convention de stageNiveau de diplôme exigé : Bac + 4 ou équivalentFonction : Stagiaire de la rechercheContexte et atouts du posteThis master internship is part of the REAVISE project: "Robust and Efficient Deep Learning based Audiovisual Speech Enhancement" funded by the French National...

  • Master internship

    il y a 4 jours


    Nancy, Grand Est, France Loria Temps plein

    Master 2 Research Internship – Acoustic Aware Speech Enhancement in Distributed Microphone ArraysLab:Loria / Inria Nancy – Grand Est, Nancy )Supervisors:Romain Serizel (LORIA), François Effa (LORIA)Start:Spring 2026Duration:6 MonthsMotivations and contextThis internship takes place within the ANR-DFG project AWESOME. The project involves researchers...


  • Villers-lès-Nancy, Grand Est, France Inria Temps plein

    Le descriptif de l'offre ci-dessous est en AnglaisType de contrat : Convention de stageNiveau de diplôme exigé : Bac + 4 ou équivalentFonction : Stagiaire de la rechercheContexte et atouts du posteContext and funding:This position is funded by the PEPR O2R AS3 project.Within this framework, the HUCEBOT team is developing multimodal strategies for online...


  • Villers-lès-Nancy, Grand Est, France Inria Temps plein

    Type de contrat : Convention de stageNiveau de diplôme exigé : Bac + 4 ou équivalentFonction : Stagiaire de la rechercheContexte et atouts du posteContext and funding:This position is funded by the euROBIN project.Within this framework, the HUCEBOT team is developing multimodal strategies for online control and adaptation of dynamic legged robot...


  • Villers-lès-Nancy, Grand Est, France Inria Temps plein

    Le descriptif de l'offre ci-dessous est en AnglaisType de contrat : CDDNiveau de diplôme exigé : Thèse ou équivalentFonction : Post-DoctorantNiveau d'expérience souhaité : De 3 à 5 ansContexte et atouts du posteThis 2-year postdoctoral position is funded by the prestigious Programme Inria Quadrant (PIQ) for the project DynaNova, which aims to advance...


  • Villers-lès-Nancy, Grand Est, France Inria Temps plein

    Type de contrat : CDDNiveau de diplôme exigé : Bac + 5 ou équivalentFonction : DoctorantContexte et atouts du posteThis 3-year PhD position is funded by the prestigious Programme Inria Quadrant (PIQ) for the project DynaNova, which aims to advance our understanding of conformational dynamics and allosteric communication in macromolecular complexes. The...


  • Nancy, Grand Est, France Centre de Recherche en Automatique de Nancy ( CRAN ) Temps plein

    How to ensure sufficient data richness for the estimation of stochastic dynamical systems in finite time?Réf ABG-134446Sujet de Thèse18/11/2025Contrat doctoralCentre de Recherche en Automatique de Nancy ( CRAN )Lieu de travailNancy - Grand Est - FranceIntitulé du sujetHow to ensure sufficient data richness for the estimation of stochastic dynamical...


  • Villers-lès-Nancy, Grand Est, France Inria Temps plein

    Le descriptif de l'offre ci-dessous est en AnglaisType de contrat : CDDNiveau de diplôme exigé : Bac + 5 ou équivalentFonction : DoctorantContexte et atouts du posteThe PhD is funded by funded by PEPR EPiQ will be carried out at Inria Nancy Grand-Est within the Inria mocqua project team.Mission confiéeQuantum programming languages with quantum control...


  • Nancy, Grand Est, France YPSO FACTO Temps plein

    We are always looking for talented and motivated people to help us disrupt the way life science processes are developed. This is how we contribute to build a better world.We help obtaining innovative molecules, shortening development times, minimizing the environmental impact.You are skilled and passionate ? You want to join a dynamic growing company?Send us...

  • Stage de Master

    il y a 4 jours


    Villers-lès-Nancy, Grand Est, France Inria Temps plein

    Type de contrat : Convention de stageNiveau de diplôme exigé : Bac + 4 ou équivalentFonction : Stagiaire de la rechercheContexte et atouts du posteLe traitement de flux de données (Data-Stream-Processing, DSP) est un modèle de calcul qui a été popularisépar les infrastructures logicielles Flink[1] et Storm[2]. Le principe général est d'avoir un...