Database

Browse 4,557 skills across 394 packs and 37 categories

Filters:Technology & Engineering mlops-infrastructure-skills Clear all

Showing 1–12 of 12 skills

Distributed Training

Triggers when users need help with distributed ML training, including data parallelism (DDP, FSDP), model parallelism (tensor, pipeline), DeepSpeed ZeRO stages 1-3, Megatron-LM, 3D parallelism, communication backends (NCCL, Gloo), gradient compression, checkpoint strategies, fault tolerance, and elastic training.

Technology & EngineeringMlops Infrastructure

Feature Stores

109L

Triggers when users need help with feature store architecture and implementation, including Feast, Tecton, and Hopsworks. Activate for questions about online vs offline feature serving, feature computation pipelines, point-in-time correctness, feature reuse, feature freshness, streaming features, and feature monitoring and drift detection.

Technology & EngineeringMlops Infrastructure

Gpu Infrastructure

120L

Triggers when users need help with GPU infrastructure for ML workloads, including GPU cluster architecture (A100, H100, H200, B200), NVIDIA CUDA ecosystem, multi-GPU training setup, InfiniBand networking, NVLink, GPU memory management, spot instances for training, cloud GPU comparison across AWS, GCP, Azure, Lambda, and CoreWeave, and on-prem vs cloud cost analysis.

Technology & EngineeringMlops Infrastructure

Inference Optimization

123L

Triggers when users need help with ML inference optimization, including model quantization (INT8, INT4, GPTQ, AWQ, GGUF), pruning strategies, knowledge distillation, ONNX Runtime, TensorRT, operator fusion, batching strategies, speculative decoding, and KV cache optimization. Activate for questions about reducing model latency, improving throughput, or lowering inference costs.

Technology & EngineeringMlops Infrastructure

ML CI CD

140L

Triggers when users need help with CI/CD for ML systems, including training pipelines, model validation, and deployment automation. Activate for questions about GitHub Actions or GitLab CI for ML, automated retraining triggers, model validation gates, deployment strategies (blue-green, canary, shadow), infrastructure as code for ML, and environment reproducibility with Docker, conda, and pip-tools.

Technology & EngineeringMlops Infrastructure

ML Cost Optimization

120L

Triggers when users need help with ML cost optimization, including compute cost management for training and inference, spot instance strategies, model size vs accuracy tradeoffs, right-sizing GPU instances, caching strategies, batch inference optimization, managed vs self-hosted infrastructure decisions, FinOps for ML teams, and cost attribution and chargeback models.

Technology & EngineeringMlops Infrastructure

ML Experiment Tracking

102L

Triggers when users need help with ML experiment tracking, including Weights & Biases, MLflow, Neptune, or ClearML setup and configuration. Activate for questions about experiment organization, metric logging, artifact management, hyperparameter sweeps, team collaboration in experiment platforms, and cost tracking across training runs.

Technology & EngineeringMlops Infrastructure

ML Monitoring

113L

Triggers when users need help with ML model monitoring in production, including data drift detection (PSI, KL divergence, KS test), concept drift, model performance monitoring, prediction monitoring, alerting strategies, shadow mode deployment, ground truth collection, monitoring dashboards, and SLA management for ML systems.

Technology & EngineeringMlops Infrastructure

ML Platform Design

150L

Triggers when users need help with internal ML platform architecture and design, including self-serve ML infrastructure, platform team responsibilities, abstraction layers for data scientists, notebook-to-production workflows, multi-tenant ML platforms, platform metrics and adoption, and build vs buy decisions for ML tools.

Technology & EngineeringMlops Infrastructure

ML Testing

121L

Triggers when users need help with testing ML systems, including unit testing ML code, integration testing ML pipelines, data validation testing, model quality testing with regression tests and performance thresholds, training pipeline testing, serving endpoint testing, load testing for ML systems, test data management, and property-based testing for data transforms.

Technology & EngineeringMlops Infrastructure

Model Registry

126L

Triggers when users need help with model versioning and registry systems, including MLflow Model Registry, Weights & Biases, and SageMaker Model Registry. Activate for questions about model lifecycle management, staging and production transitions, approval workflows, model metadata and lineage, packaging formats, CI/CD integration, and model governance and compliance.

Technology & EngineeringMlops Infrastructure

Model Serving

118L

Triggers when users need help with model serving and deployment, including serving frameworks like TorchServe, Triton Inference Server, TensorFlow Serving, BentoML, or vLLM. Activate for questions about online vs batch vs streaming inference, REST and gRPC APIs, model warm-up, autoscaling, multi-model serving, A/B testing for models, and canary deployments.

Technology & EngineeringMlops Infrastructure