Training Engineer

Il y a 28 minutes

Télétravail, France Wazza Studio LLC Temps plein

Wazza Studio LLCLocation: Remote (global) or Paris, France Type: Full-time Level: Mid to Senior Compensation: Competitive Reports to: ML Research Lead Hiring: 3 positionsAbout Wazza Wazza is building the IP-licensed AI model network for film production. We train production-grade video generation models on properly licensed content and distribute royalties to IP owners via blockchain.Think: "Spotify for AI model licensing, but for filmmaking."We're training our first model now (producer's IP, Netflix-compliant) and scaling to 10+ models over the next 18 months.The RoleWe need 3 training engineers to build and scale the infrastructure that trains multi-billion parameter video generation models. You'll work on distributed training, data pipelines, experiment tracking, and optimization, making sure our DiT models train efficiently on clusters of A100 GPUs.This is hands-on infrastructure work. You'll spend your time writing Python/YAML/bash, debugging NCCL errors, optimizing dataloaders, and making training runs 2x faster. If you've ever thought "I could make this model train better," this role is for you.This role is perfect for someone who:Loves systems programming and making things fast Has war stories about debugging distributed training failures Gets excited about squeezing 10% more throughput from a GPU cluster Wants to work on cutting-edge models (DiT, flow matching) in productionWhat You'll DoCore ResponsibilitiesTraining Infrastructure (40%)Build distributed training pipelines for 3B-14B parameter DiT models Set up multi-node training on AWS A100 GPUs) Optimize training throughput: mixed precision, gradient accumulation, activation checkpointing Debug NCCL/NVSHMEM issues, handle node failures gracefully Implement checkpointing, resumption, and fault toleranceData Pipeline (30%)Build scalable video preprocessing pipelines (depth estimation, camera pose, segmentation) Optimize data loading to prevent GPU starvation (PyTorch DataLoader, FFCV, WebDataset) Handle hours of 1080p video per model Implement data augmentation strategies for video Set up versioned datasets with provenance trackingMonitoring & Optimization (20%)Instrument training with Weights & Biases / MLflow Build real-time dashboards for loss curves, GPU utilization, throughput Profile training runs, identify bottlenecks Implement gradient clipping, learning rate scheduling, warmup strategies Cost optimization: spot instances, preemption handling, cost per model trackingExperimentation (10%)Support ML Research Lead with rapid experimentation Run ablation studies and architecture comparisons A/B test different training configurations Document best practices and reproduce successful runsRequirementsMust Have3+ years training deep learning models in production Strong Python and PyTorch (or JAX/TensorFlow with willingness to learn PyTorch) Distributed training experience: DDP, FSDP, DeepSpeed, or similar GPU programming knowledge: CUDA awareness, memory management, kernel profiling Linux systems proficiency: bash scripting, process management, network debugging Cloud experience: AWS/GCP/Azure, provisioning GPU instances, cost management Comfort with ambiguity: early-stage startup, things change rapidlyIdeally HaveTrained billion-parameter models (1B+ params) Experience with video/multimodal models (not just image/text) Kubernetes for ML workloads (Kubeflow, KubeFlow Pipelines, or custom) Infrastructure as Code: Terraform, Pulumi, CloudFormation High-performance dataloaders: FFCV, WebDataset, DALI Built experiment tracking systems from scratch or heavily customized existing onesNice to HaveExperience with model parallelism (tensor parallel, pipeline parallel) Knowledge of quantization and mixed-precision training Familiarity with Ray, Dask, or other distributed compute frameworks Understanding of compiler optimizations (TorchScript, TorchDynamo, XLA) Open-source contributions to PyTorch, Hugging Face, or ML infra projectsDay-to-Day ExamplesMonday: ML Research Lead wants to try a new attention mechanism. You modify the training script, kick off a run on 8 A100s, instrument it with W&B, and have results by EOD.Tuesday: Training run OOMed at step 12,000. You debug, realize activation checkpointing wasn't applied to a new layer, fix it, resume from last checkpoint.Wednesday: Data loading is the bottleneck (GPUs at 60% utilization). You profile the dataloader, discover the depth estimation preprocessing is slow, move it to GPU, get utilization to 95%.Thursday: Spot instance got preempted mid-training. Training resumed automatically from last checkpoint thanks to your fault-tolerance system. You review logs, confirm no data corruption.Friday: Finance asks for training cost breakdown. You pull metrics from your tracking system: $2.1M spent, 180K GPU hours, $11.67/GPU-hour average. Under budget.Tech StackLanguages: Python (primary), Bash, YAML, SQL Frameworks: PyTorch, Hugging Face, DeepSpeed, FSDP Cloud: AWS (p4d.24xlarge, p5.48xlarge), S3, EBS Orchestration: Kubernetes, Slurm (optional) Monitoring: Weights & Biases, Prometheus, Grafana Data: FFmpeg, OpenCV, PyAV, WebDataset Infra: Docker, Terraform, GitHub ActionsWe're flexible: If you have strong opinions on better tools, we'll listen.Projects You'll Work On (First 6 Months)Project 1: First Model Training Pipeline Set up 8-node A100 cluster on AWS Build data preprocessing for producer's 50 hours of footage Train 3B DiT model from scratch Success: Model trained in

Sr. QA Engineer

Il y a 18 minutes

Télétravail, France AppIQ Tech Temps plein

AppIQ Tech is seeking a meticulous and strategic QA Engineer / Sr. QA Engineer to ensure the quality and reliability of our Machine-Learning-driven e-commerce funnel optimisation and digital advertising platform.You will be responsible for defining the testing strategy for high-performance applications that leverage our proprietary Predictive AI solutions.As...
Field Service Engineer with French

Il y a 40 minutes

Télétravail, France Industrial Scientific Temps plein

The team at Industrial Scientific is committed to ending death on the job by the year 2050. We hire inquisitive, motivated people, foster an encouraging environment, and we let them do their job. Our team is highly engaged, builds quality solutions and delivers outstanding customer service. Our leaders understand the critical elements of breakthrough...
AI Engineering Apprentice

Il y a 16 minutes

Télétravail, France LunarTech Temps plein

AI Engineering Apprentice (18-Month Program)LinkedIn: Location: Remote / Hybrid / On-siteDuration: 18 Months (Fixed-Term Apprenticeship)Department: Engineering / AI & DataCompensation: Paid apprenticeship (salary or stipend based on experience)About the ApprenticeshipWe're launching an 18-month AI Engineering Apprenticeship designed to develop the next...
VP Sales

Il y a 22 minutes

Télétravail, France Kinaxis Temps plein

About KinaxisElevate your career journey by embracing a new challenge with Kinaxis. We are experts in tech, but it's really our people who give us passion to always seek ways to do things better. As such, we're serious about your career growth and professional development, because People matter at Kinaxis.In 1984, we started out as a team of three engineers...
Enterprise Account Executive

Il y a 26 minutes

Télétravail, France Procore Temps plein

We're looking for an Enterprise Account Executive to join Procore's Sales Team. In this role, you'll apply an understanding of Procore's products, sales methodology, processes, and prospecting techniques to acquire new enterprise customers and expand our existing install base.You will be responsible for both acquiring new logos and driving growth with...
Delivery Executive

Il y a 20 minutes

Télétravail, France EPAM Systems Temps plein

We are looking for a Delivery Executive (Director level) with re-insurance and/or Insurance experience to support our key clients in France.To be successful in this role, you must be able to engage people, clients, work autonomously and build your own engagement tactics aligned to EPAM's strategy and goals.Are you ready to solve complex problems, drive...
Customer Success Relationship Manager

Il y a 22 minutes

Télétravail, France Industrial Scientific Temps plein

Our MissionAt Industrial Scientific, we are committed to ending workplace deaths by 2050. We hire smart, motivated people and give them the tools and support to make a meaningful impact. Our Customer Success Relationship Managers are key to achieving this vision through proactive engagement with customers, ensuring safety, and expanding the use of our...

Amériques

Europe

Asie / Océanie

Afrique

Training Engineer

Sr. QA Engineer

Field Service Engineer with French

AI Engineering Apprentice

VP Sales

Enterprise Account Executive

Delivery Executive

Customer Success Relationship Manager