Training Engineer

il y a 6 jours


Télétravail, France Wazza Studio LLC Temps plein

Wazza Studio LLCLocation: Remote (global) or Paris, France Type: Full-time Level: Mid to Senior Compensation: Competitive Reports to: ML Research Lead Hiring: 3 positionsAbout Wazza Wazza is building the IP-licensed AI model network for film production. We train production-grade video generation models on properly licensed content and distribute royalties to IP owners via blockchain.Think: "Spotify for AI model licensing, but for filmmaking."We're training our first model now (producer's IP, Netflix-compliant) and scaling to 10+ models over the next 18 months.The RoleWe need 3 training engineers to build and scale the infrastructure that trains multi-billion parameter video generation models. You'll work on distributed training, data pipelines, experiment tracking, and optimization, making sure our DiT models train efficiently on clusters of A100 GPUs.This is hands-on infrastructure work. You'll spend your time writing Python/YAML/bash, debugging NCCL errors, optimizing dataloaders, and making training runs 2x faster. If you've ever thought "I could make this model train better," this role is for you.This role is perfect for someone who:Loves systems programming and making things fast Has war stories about debugging distributed training failures Gets excited about squeezing 10% more throughput from a GPU cluster Wants to work on cutting-edge models (DiT, flow matching) in productionWhat You'll DoCore ResponsibilitiesTraining Infrastructure (40%)Build distributed training pipelines for 3B-14B parameter DiT models Set up multi-node training on AWS A100 GPUs) Optimize training throughput: mixed precision, gradient accumulation, activation checkpointing Debug NCCL/NVSHMEM issues, handle node failures gracefully Implement checkpointing, resumption, and fault toleranceData Pipeline (30%)Build scalable video preprocessing pipelines (depth estimation, camera pose, segmentation) Optimize data loading to prevent GPU starvation (PyTorch DataLoader, FFCV, WebDataset) Handle hours of 1080p video per model Implement data augmentation strategies for video Set up versioned datasets with provenance trackingMonitoring & Optimization (20%)Instrument training with Weights & Biases / MLflow Build real-time dashboards for loss curves, GPU utilization, throughput Profile training runs, identify bottlenecks Implement gradient clipping, learning rate scheduling, warmup strategies Cost optimization: spot instances, preemption handling, cost per model trackingExperimentation (10%)Support ML Research Lead with rapid experimentation Run ablation studies and architecture comparisons A/B test different training configurations Document best practices and reproduce successful runsRequirementsMust Have3+ years training deep learning models in production Strong Python and PyTorch (or JAX/TensorFlow with willingness to learn PyTorch) Distributed training experience: DDP, FSDP, DeepSpeed, or similar GPU programming knowledge: CUDA awareness, memory management, kernel profiling Linux systems proficiency: bash scripting, process management, network debugging Cloud experience: AWS/GCP/Azure, provisioning GPU instances, cost management Comfort with ambiguity: early-stage startup, things change rapidlyIdeally HaveTrained billion-parameter models (1B+ params) Experience with video/multimodal models (not just image/text) Kubernetes for ML workloads (Kubeflow, KubeFlow Pipelines, or custom) Infrastructure as Code: Terraform, Pulumi, CloudFormation High-performance dataloaders: FFCV, WebDataset, DALI Built experiment tracking systems from scratch or heavily customized existing onesNice to HaveExperience with model parallelism (tensor parallel, pipeline parallel) Knowledge of quantization and mixed-precision training Familiarity with Ray, Dask, or other distributed compute frameworks Understanding of compiler optimizations (TorchScript, TorchDynamo, XLA) Open-source contributions to PyTorch, Hugging Face, or ML infra projectsDay-to-Day ExamplesMonday: ML Research Lead wants to try a new attention mechanism. You modify the training script, kick off a run on 8 A100s, instrument it with W&B, and have results by EOD.Tuesday: Training run OOMed at step 12,000. You debug, realize activation checkpointing wasn't applied to a new layer, fix it, resume from last checkpoint.Wednesday: Data loading is the bottleneck (GPUs at 60% utilization). You profile the dataloader, discover the depth estimation preprocessing is slow, move it to GPU, get utilization to 95%.Thursday: Spot instance got preempted mid-training. Training resumed automatically from last checkpoint thanks to your fault-tolerance system. You review logs, confirm no data corruption.Friday: Finance asks for training cost breakdown. You pull metrics from your tracking system: $2.1M spent, 180K GPU hours, $11.67/GPU-hour average. Under budget.Tech StackLanguages: Python (primary), Bash, YAML, SQL Frameworks: PyTorch, Hugging Face, DeepSpeed, FSDP Cloud: AWS (p4d.24xlarge, p5.48xlarge), S3, EBS Orchestration: Kubernetes, Slurm (optional) Monitoring: Weights & Biases, Prometheus, Grafana Data: FFmpeg, OpenCV, PyAV, WebDataset Infra: Docker, Terraform, GitHub ActionsWe're flexible: If you have strong opinions on better tools, we'll listen.Projects You'll Work On (First 6 Months)Project 1: First Model Training Pipeline Set up 8-node A100 cluster on AWS Build data preprocessing for producer's 50 hours of footage Train 3B DiT model from scratch Success: Model trained in



  • Télétravail, France All Cares Temps plein

    About the CompanyCephalgo is a Strasbourg-based technology company founded in 2020, focused on developing AI solutions that ensure safety, compliance, and trust in human-AI interactions. Originally rooted in healthcare innovation, Cephalgo's platform helps organizations securely analyze and monitor voice and emotion data while meeting privacy, security, and...

  • Security Engineer

    il y a 2 semaines


    Télétravail, France Collective Temps plein

    Our client is pioneering asset tokenization, bringing regulatory-compliant, institutional-grade assets on-chain and unlocking DeFi composability for RWAs. They operate at the intersection of deep legal, technical, and operational expertise and move fast to tackle hard problems with reliable, scalable systems.We're hiring a Security Engineer to own and build...


  • Télétravail, France Aircall Temps plein

    France RemoteEngineering – Infrastructure /Permanent Full Time Employee /RemoteAircall is a unicorn AI-powered customer communications platform used by 22,000+ companies worldwide to drive revenue, faster resolutions, and scale. We're redefining what a customer communications platform can be—by combining voice, SMS, WhatsApp, and AI into one seamless...


  • Télétravail, France Industrial Scientific Temps plein

    The team at Industrial Scientific is committed to ending death on the job by the year 2050. We hire inquisitive, motivated people, foster an encouraging environment, and we let them do their job. Our team is highly engaged, builds quality solutions and delivers outstanding customer service. Our leaders understand the critical elements of breakthrough...

  • Field Engineering Program

    il y a 1 semaine


    Télétravail, France FieldCore Temps plein

    FieldCore is looking for engaged and driven individuals with a passion for engineering to join our Field Engineering Program and Technical Field team based in France.About GE Vernova:GE Vernova is a planned, purpose-built global energy company that includes Power, Wind, and Electrification businesses and is supported by its accelerator businesses of Advanced...

  • Senior Data Scientist J119

    il y a 2 semaines


    Télétravail, France Skm Group Temps plein

    We are a software house with a 18-year history, a rich portfolio, projects all over the world and an appetite for more. We have built our brand on professionalism and flexibility in delivering software solutions. We are not afraid of unconventional ideas and value innovation and imaginative change.Job Description:We are seeking a highly skilled and...

  • Delivery Executive

    il y a 1 semaine


    Télétravail, France EPAM Systems Temps plein

    We are looking for a Delivery Executive (Director level) with re-insurance and/or Insurance experience to support our key clients in France.To be successful in this role, you must be able to engage people, clients, work autonomously and build your own engagement tactics aligned to EPAM's strategy and goals.Are you ready to solve complex problems, drive...

  • AI Security Senior Architect

    il y a 1 semaine


    Télétravail, France A&O Shearman Temps plein

    We are currently recruiting for an AI Security Senior Architect to join our London office. DEPARTMENT PURPOSEThe Information Security team is a strategic enabler for our global law firm, focused on protecting client data, intellectual property, and business operations while enabling secure innovation. Through four key pillars – Digital Trust, Technical...