Security Monitoring and Detection
Build the detection layer that catches attacks in production — log
Prevention is the first line; detection is the second. SAST, signed builds, policy enforcement, hardened defaults — these reduce the attack surface but cannot eliminate it. Detection is the layer that fires when an attacker gets through prevention. Without detection, attacks can dwell for months or years. ## Key Points - **Authentication events** — every login, every failed login, every credential change. - **Authorization decisions** — every access denial, every privilege escalation, every cross-tenant request. - **Administrative actions** — every change to roles, policies, security groups, IAM. Every infrastructure mutation. - **Data access** — reads of sensitive tables, exports, downloads, cross-region transfers. - **Network traffic** — flow logs at trust boundaries; payload inspection where regulations require. - **Application errors** — particularly auth errors, validation failures, anomalous responses. - **Who** — user, service account, IP, session. - **What** — the action and the target. - **Context** — timestamp, source, resource ID, before/after state where applicable. - **Hot** (immediate query, fast): 30–90 days. Enough for active investigation. - **Warm** (queryable but slower): 6–12 months. Enough for incident timeline reconstruction. - **Cold** (archived, slow to retrieve): 1–7 years. Enough for compliance and long-tail forensics. ## Quick Example ``` auth_failure WHERE source_ip NOT IN known_office_ips GROUP BY user_id, source_ip HAVING count > 10 within 5min ```
skilldb get devsecops-pipeline-skills/Security Monitoring and DetectionFull skill: 149 linesPrevention is the first line; detection is the second. SAST, signed builds, policy enforcement, hardened defaults — these reduce the attack surface but cannot eliminate it. Detection is the layer that fires when an attacker gets through prevention. Without detection, attacks can dwell for months or years.
Detection is a discipline distinct from prevention. Different tools, different time horizons, different metrics. The team running detection thinks in terms of dwell time, signal quality, and alert fatigue.
What to Log
Log enough to investigate any incident, but not so much that the cost dominates and the signal drowns. The standard log set:
- Authentication events — every login, every failed login, every credential change.
- Authorization decisions — every access denial, every privilege escalation, every cross-tenant request.
- Administrative actions — every change to roles, policies, security groups, IAM. Every infrastructure mutation.
- Data access — reads of sensitive tables, exports, downloads, cross-region transfers.
- Network traffic — flow logs at trust boundaries; payload inspection where regulations require.
- Application errors — particularly auth errors, validation failures, anomalous responses.
For each log, retain three categories of data:
- Who — user, service account, IP, session.
- What — the action and the target.
- Context — timestamp, source, resource ID, before/after state where applicable.
The log format is structured (JSON, ECS, OCSF). Free-text logs are unsearchable at scale.
Where to Send Logs
Centralize. Logs scattered across services, regions, and platforms cannot support detection. The detection team needs one queryable surface.
Pick a SIEM (Splunk, Elastic, Sumo Logic, Sentinel) or a query-engine model (Snowflake, BigQuery, ClickHouse) that meets your team's preferences. Splunk is industry standard; the query-engine pattern is increasingly common because the cost model scales better.
Set retention based on threat model:
- Hot (immediate query, fast): 30–90 days. Enough for active investigation.
- Warm (queryable but slower): 6–12 months. Enough for incident timeline reconstruction.
- Cold (archived, slow to retrieve): 1–7 years. Enough for compliance and long-tail forensics.
Costs scale with retention; tier accordingly.
Detection Rules
A detection rule is a query that, when matched, generates an alert. A simple rule:
auth_failure
WHERE source_ip NOT IN known_office_ips
GROUP BY user_id, source_ip
HAVING count > 10 within 5min
Alert: 10+ failed logins from an unknown IP, suggesting credential stuffing.
A modern detection rule set has hundreds of rules across many categories:
- Brute force — failed logins, failed API auth.
- Credential abuse — known-leaked credential usage, impossible travel, anomalous session lifetimes.
- Privilege escalation — unexpected role grants, admin actions outside normal hours.
- Data exfiltration — unusual data export volume, sensitive data accessed at scale.
- Lateral movement — services calling services they don't normally call.
- Persistence — new accounts created, new keys provisioned, new scheduled tasks.
- Defense evasion — log tampering, monitoring tool disabling, security group changes.
The rule set should be version-controlled, peer-reviewed, and tested. Detection rules are code; treat them as code.
Detection Engineering
Detection engineering is the discipline of building, testing, and tuning detections. It has a few core practices:
- Adversary emulation. Run controlled simulations of attack techniques (red team, atomic tests, breach-and-attack-simulation tools). For each technique, verify your detections fire. The ones that don't fire are gaps; build the detection.
- Detection-as-code. Rules are version controlled. Each rule has tests (sample events that should match, sample events that should not). The CI for the detection repo runs tests on every change.
- MITRE ATT&CK mapping. Each detection is tagged with the ATT&CK technique it covers. The coverage is visible in a heat map; gaps are obvious.
- False-positive tuning. Each rule has a tuning history. The team reviews the rules that fire too often and refines them.
The metrics:
- Coverage — what percentage of relevant ATT&CK techniques have detections.
- Mean time to detect (MTTD) — from attacker action to alert firing.
- False-positive rate — alerts dismissed as not malicious.
- Detection-to-investigation ratio — how many alerts get investigated. Less than 1.0 means alerts are being dropped.
Alert Triage
Alerts arrive faster than humans can read them. The triage system:
- Auto-suppress known-benign. Some alerts (a deploy from a known engineer at a known time) are noise; suppress automatically.
- Auto-enrich. The alert arrives with context already populated — user history, IP reputation, related events. The analyst doesn't have to fetch.
- Tier by severity. Critical alerts get human eyes immediately. High alerts within an hour. Medium alerts within a day. Low alerts in batch review.
- Auto-correlate. Multiple related alerts collapse into a single incident. The analyst investigates one incident, not five alerts.
A good triage pipeline reduces 1,000 raw alerts to 10–20 incidents that humans investigate. The compression ratio is the metric; tune for it.
The Investigation
When an alert is real, the investigation:
- Confirm. Is this an attack or a noisy alert?
- Scope. What systems and accounts are affected? What's the timeline?
- Contain. Stop the attack from spreading. Disable the account; isolate the host; revoke the token.
- Eradicate. Remove the attacker's persistence (backdoors, persistent credentials, scheduled tasks).
- Recover. Restore systems to a known-good state.
- Lessons. Postmortem. What detection caught this? What detection should have caught it earlier?
The investigation is supported by the centralized logs. The analyst pivots from the initial alert across users, IPs, sessions, and resources to reconstruct the attack.
Threat Intelligence Integration
External threat intelligence feeds — IP reputation, malware signatures, leaked credential lists — augment detection. When a known-bad IP appears in your access logs, the alert fires automatically.
Integration patterns:
- Block lists — at network edges, deny known-bad IPs. Reduces noise.
- Watch lists — log access from known-bad IPs as alerts even if not blocked. Catches attacks where the IP isn't blocklisted yet.
- Indicator enrichment — attach reputation to log events. The analyst sees "this IP is on the X feed" alongside the event.
Pick threat-intel sources carefully. Free feeds are noisy; commercial ones are expensive. Curate by what's relevant to your environment.
The Cost Conversation
Detection is expensive. Storage, SIEM licenses, analyst headcount, threat-intel subscriptions. The cost grows with log volume.
Cost optimization:
- Tier retention. Hot, warm, cold. Most queries hit hot; long-tail forensics use cold.
- Sample non-critical logs. A 1% sample of access logs from low-risk services is enough for trends.
- Move to query-engine pricing. SIEM cost per ingested GB; query engines cost per scanned GB at query time. For low-query workloads, the latter wins.
Cost should be visible. Each team that generates logs sees the cost; the security team makes the case for the cost.
Anti-Patterns
Logs without detection. Storing logs but not running queries against them. The logs are useless if no one looks at them.
Detection without tests. A rule update breaks; alerts stop firing; nobody notices for weeks. Test rules.
Alert fatigue. Hundreds of alerts a day; analysts dismiss them all; real attacks get missed. Tune; reduce; auto-correlate.
Single SIEM as point of failure. All logs in one place; the SIEM goes down; the team is blind. Have a backup query path.
Threat intel as the only signal. Feeds catch known-bad; the unknown-bad slips through. Combine with behavioral detection.
No coverage measurement. The team doesn't know which ATT&CK techniques are uncovered. Build a coverage dashboard.
Install this skill directly: skilldb add devsecops-pipeline-skills
Related Skills
SAST and DAST Integration in CI/CD
Integrate static and dynamic application security testing into the CI/CD
Security as Code
Encode security policy as version-controlled, testable artifacts that
Software Supply Chain Security
Defend against supply-chain attacks: malicious dependencies, typosquats,
Threat Modeling in Design Reviews
Run a threat modeling session as part of a design review for any
Adversarial Code Review
Adversarial implementation review methodology that validates code completeness against requirements with fresh objectivity. Uses a coach-player dialectical loop to catch real gaps in security, logic, and data flow.
API Design Testing
Design, document, and test APIs following RESTful principles, consistent