Skills Marketplace
Browse 2,562 skills across 122 packs and 30 categories
Distributed Training Expert
125LTriggers when users need help with distributed ML training, including data parallelism (DDP, FSDP), model parallelism (tensor, pipeline), DeepSpeed ZeRO stages 1-3, Megatron-LM, 3D parallelism, communication backends (NCCL, Gloo), gradient compression, checkpoint strategies, fault tolerance, and elastic training.
Feature Store Expert
109LTriggers when users need help with feature store architecture and implementation, including Feast, Tecton, and Hopsworks. Activate for questions about online vs offline feature serving, feature computation pipelines, point-in-time correctness, feature reuse, feature freshness, streaming features, and feature monitoring and drift detection.
GPU Infrastructure Expert
120LTriggers when users need help with GPU infrastructure for ML workloads, including GPU cluster architecture (A100, H100, H200, B200), NVIDIA CUDA ecosystem, multi-GPU training setup, InfiniBand networking, NVLink, GPU memory management, spot instances for training, cloud GPU comparison across AWS, GCP, Azure, Lambda, and CoreWeave, and on-prem vs cloud cost analysis.
Inference Optimization Expert
123LTriggers when users need help with ML inference optimization, including model quantization (INT8, INT4, GPTQ, AWQ, GGUF), pruning strategies, knowledge distillation, ONNX Runtime, TensorRT, operator fusion, batching strategies, speculative decoding, and KV cache optimization. Activate for questions about reducing model latency, improving throughput, or lowering inference costs.
ML CI/CD Expert
140LTriggers when users need help with CI/CD for ML systems, including training pipelines, model validation, and deployment automation. Activate for questions about GitHub Actions or GitLab CI for ML, automated retraining triggers, model validation gates, deployment strategies (blue-green, canary, shadow), infrastructure as code for ML, and environment reproducibility with Docker, conda, and pip-tools.
ML Cost Optimization Expert
120LTriggers when users need help with ML cost optimization, including compute cost management for training and inference, spot instance strategies, model size vs accuracy tradeoffs, right-sizing GPU instances, caching strategies, batch inference optimization, managed vs self-hosted infrastructure decisions, FinOps for ML teams, and cost attribution and chargeback models.
ML Experiment Tracking Expert
102LTriggers when users need help with ML experiment tracking, including Weights & Biases, MLflow, Neptune, or ClearML setup and configuration. Activate for questions about experiment organization, metric logging, artifact management, hyperparameter sweeps, team collaboration in experiment platforms, and cost tracking across training runs.
ML Monitoring Expert
113LTriggers when users need help with ML model monitoring in production, including data drift detection (PSI, KL divergence, KS test), concept drift, model performance monitoring, prediction monitoring, alerting strategies, shadow mode deployment, ground truth collection, monitoring dashboards, and SLA management for ML systems.
ML Platform Design Expert
150LTriggers when users need help with internal ML platform architecture and design, including self-serve ML infrastructure, platform team responsibilities, abstraction layers for data scientists, notebook-to-production workflows, multi-tenant ML platforms, platform metrics and adoption, and build vs buy decisions for ML tools.
ML Testing Expert
121LTriggers when users need help with testing ML systems, including unit testing ML code, integration testing ML pipelines, data validation testing, model quality testing with regression tests and performance thresholds, training pipeline testing, serving endpoint testing, load testing for ML systems, test data management, and property-based testing for data transforms.
Model Registry Expert
126LTriggers when users need help with model versioning and registry systems, including MLflow Model Registry, Weights & Biases, and SageMaker Model Registry. Activate for questions about model lifecycle management, staging and production transitions, approval workflows, model metadata and lineage, packaging formats, CI/CD integration, and model governance and compliance.
Model Serving Infrastructure Expert
118LTriggers when users need help with model serving and deployment, including serving frameworks like TorchServe, Triton Inference Server, TensorFlow Serving, BentoML, or vLLM. Activate for questions about online vs batch vs streaming inference, REST and gRPC APIs, model warm-up, autoscaling, multi-model serving, A/B testing for models, and canary deployments.