Skip to content

Skills Marketplace

Browse 2,562 skills across 122 packs and 30 categories

Showing 1–12 of 12 skills

Distributed Training Expert

125L

Triggers when users need help with distributed ML training, including data parallelism (DDP, FSDP), model parallelism (tensor, pipeline), DeepSpeed ZeRO stages 1-3, Megatron-LM, 3D parallelism, communication backends (NCCL, Gloo), gradient compression, checkpoint strategies, fault tolerance, and elastic training.

Technology & EngineeringMlops Infrastructure

Feature Store Expert

109L

Triggers when users need help with feature store architecture and implementation, including Feast, Tecton, and Hopsworks. Activate for questions about online vs offline feature serving, feature computation pipelines, point-in-time correctness, feature reuse, feature freshness, streaming features, and feature monitoring and drift detection.

Technology & EngineeringMlops Infrastructure

GPU Infrastructure Expert

120L

Triggers when users need help with GPU infrastructure for ML workloads, including GPU cluster architecture (A100, H100, H200, B200), NVIDIA CUDA ecosystem, multi-GPU training setup, InfiniBand networking, NVLink, GPU memory management, spot instances for training, cloud GPU comparison across AWS, GCP, Azure, Lambda, and CoreWeave, and on-prem vs cloud cost analysis.

Technology & EngineeringMlops Infrastructure

Inference Optimization Expert

123L

Triggers when users need help with ML inference optimization, including model quantization (INT8, INT4, GPTQ, AWQ, GGUF), pruning strategies, knowledge distillation, ONNX Runtime, TensorRT, operator fusion, batching strategies, speculative decoding, and KV cache optimization. Activate for questions about reducing model latency, improving throughput, or lowering inference costs.

Technology & EngineeringMlops Infrastructure

ML CI/CD Expert

140L

Triggers when users need help with CI/CD for ML systems, including training pipelines, model validation, and deployment automation. Activate for questions about GitHub Actions or GitLab CI for ML, automated retraining triggers, model validation gates, deployment strategies (blue-green, canary, shadow), infrastructure as code for ML, and environment reproducibility with Docker, conda, and pip-tools.

Technology & EngineeringMlops Infrastructure

ML Cost Optimization Expert

120L

Triggers when users need help with ML cost optimization, including compute cost management for training and inference, spot instance strategies, model size vs accuracy tradeoffs, right-sizing GPU instances, caching strategies, batch inference optimization, managed vs self-hosted infrastructure decisions, FinOps for ML teams, and cost attribution and chargeback models.

Technology & EngineeringMlops Infrastructure

ML Experiment Tracking Expert

102L

Triggers when users need help with ML experiment tracking, including Weights & Biases, MLflow, Neptune, or ClearML setup and configuration. Activate for questions about experiment organization, metric logging, artifact management, hyperparameter sweeps, team collaboration in experiment platforms, and cost tracking across training runs.

Technology & EngineeringMlops Infrastructure

ML Monitoring Expert

113L

Triggers when users need help with ML model monitoring in production, including data drift detection (PSI, KL divergence, KS test), concept drift, model performance monitoring, prediction monitoring, alerting strategies, shadow mode deployment, ground truth collection, monitoring dashboards, and SLA management for ML systems.

Technology & EngineeringMlops Infrastructure

ML Platform Design Expert

150L

Triggers when users need help with internal ML platform architecture and design, including self-serve ML infrastructure, platform team responsibilities, abstraction layers for data scientists, notebook-to-production workflows, multi-tenant ML platforms, platform metrics and adoption, and build vs buy decisions for ML tools.

Technology & EngineeringMlops Infrastructure

ML Testing Expert

121L

Triggers when users need help with testing ML systems, including unit testing ML code, integration testing ML pipelines, data validation testing, model quality testing with regression tests and performance thresholds, training pipeline testing, serving endpoint testing, load testing for ML systems, test data management, and property-based testing for data transforms.

Technology & EngineeringMlops Infrastructure

Model Registry Expert

126L

Triggers when users need help with model versioning and registry systems, including MLflow Model Registry, Weights & Biases, and SageMaker Model Registry. Activate for questions about model lifecycle management, staging and production transitions, approval workflows, model metadata and lineage, packaging formats, CI/CD integration, and model governance and compliance.

Technology & EngineeringMlops Infrastructure

Model Serving Infrastructure Expert

118L

Triggers when users need help with model serving and deployment, including serving frameworks like TorchServe, Triton Inference Server, TensorFlow Serving, BentoML, or vLLM. Activate for questions about online vs batch vs streaming inference, REST and gRPC APIs, model warm-up, autoscaling, multi-model serving, A/B testing for models, and canary deployments.

Technology & EngineeringMlops Infrastructure