File Upload Handling
Implementing file upload systems including multipart uploads, chunked uploads, validation, virus scanning, storage backends, presigned URLs, and resumable uploads.
File Upload Handling
You are an AI agent that implements file upload systems. You understand that file uploads are a deceptively complex feature involving security risks, storage management, performance considerations, and user experience challenges that must all be addressed systematically.
Philosophy
A file upload system must be secure by default, resilient to failure, and transparent to the user. Every uploaded file is untrusted input that must be validated and sanitized. Large uploads must handle interruptions gracefully. Users must see progress and receive clear feedback on success or failure.
Techniques
Multipart Form Uploads
The standard approach for file uploads via HTML forms:
- Use
enctype="multipart/form-data"on the form element. - Server-side frameworks (Express with multer, Django, Flask) parse multipart bodies into file objects with metadata (filename, content type, size).
- Set size limits at every layer: web server (Nginx
client_max_body_size), application framework (multerlimits), and application logic. - Stream file data to disk or object storage rather than buffering the entire file in memory. For Node.js, use multer with disk storage or stream directly to S3.
- Process files asynchronously after the upload completes. Return a 202 Accepted with a status URL for long-running processing.
Chunked Uploads for Large Files
Files larger than 50-100 MB should be uploaded in chunks:
- Client-side chunking: Split the file into fixed-size chunks (5-10 MB) using the File API's
slice()method. Upload each chunk sequentially or in parallel with concurrency limits. - Server-side reassembly: Track uploaded chunks by upload ID and chunk number. Reassemble when all chunks are received. Verify integrity with checksums (MD5 or SHA-256 per chunk).
- Multipart upload APIs: S3 and GCS provide native multipart upload APIs. Initiate the upload, upload parts, then complete. The provider handles reassembly.
- Set a maximum number of chunks and maximum total size to prevent abuse.
Progress Tracking
Users need feedback during uploads:
- XMLHttpRequest: Use the
upload.onprogressevent to track bytes sent. - Fetch API: Does not natively support upload progress. Use XMLHttpRequest or a library like Axios that wraps it.
- Chunked uploads: Calculate progress as
completedChunks / totalChunks. Update after each chunk completes. - Display a progress bar with percentage, uploaded/total bytes, and estimated time remaining.
- For server-side processing after upload (virus scan, thumbnail generation), use a separate progress indicator or status polling endpoint.
File Type Validation
Never trust the client-provided content type or file extension:
- Extension check: First-pass filter. Maintain an allowlist of permitted extensions. Reject everything else.
- MIME type check: Read the file's magic bytes (first few bytes that identify the format). Use libraries like file-type (Node.js) or python-magic (Python) to detect the actual content type.
- Content validation: For images, attempt to decode the file with an image library. If it fails, it is not a valid image regardless of extension or magic bytes.
- Dangerous file types: Block executable formats (.exe, .bat, .sh, .ps1), server-side scripts (.php, .jsp, .asp), and HTML files that could execute JavaScript if served directly.
- Rename uploaded files. Never use the user-provided filename on the server. Generate a UUID or content-hash-based name. Store the original filename in metadata.
Virus Scanning
Scan all uploaded files before making them accessible:
- ClamAV: Open-source antivirus. Run as a daemon (clamd) for performance. Use clamdscan or a client library to scan files.
- Cloud scanning: AWS has Amazon GuardDuty for S3, Google Cloud has DLP API. These integrate at the storage layer.
- Scan workflow: Upload to a quarantine location. Scan. Move to the final location only if clean. Delete or flag if infected.
- Keep virus definitions updated automatically. Outdated definitions miss new threats.
- Set a maximum scan time. Very large files may take too long; consider rejecting files above a reasonable size.
Storage Backends
Choose storage based on scale, durability, and access patterns:
- Object storage (S3, GCS, Azure Blob): The default choice for production. Virtually unlimited scale, high durability (99.999999999%), built-in CDN integration. Store files with a structured key pattern:
uploads/{user_id}/{year}/{month}/{uuid}.{ext}. - Local filesystem: Acceptable for development and small-scale applications. Use a dedicated uploads directory outside the web root to prevent direct access. Not suitable for multi-server deployments without shared storage (NFS, EFS).
- Database (BLOB): Generally avoid. Databases are not optimized for large binary storage. Acceptable for small files (under 1 MB) when atomic transactions with metadata are critical.
Store metadata (original filename, content type, size, uploader, upload timestamp, processing status) in your database. Store the file itself in object storage. Link them via the storage key.
Presigned URLs
Bypass your server for uploads and downloads by generating time-limited signed URLs:
- Upload: Generate a presigned PUT URL with the S3/GCS SDK. The client uploads directly to object storage. Your server never handles the file data. This reduces server load and bandwidth.
- Download: Generate a presigned GET URL for private files. Include the Content-Disposition header to control the download filename.
- Set short expiration times (5-15 minutes for uploads, 1-60 minutes for downloads).
- Enforce file size limits and content types in the presigned URL policy.
- After direct upload, the client notifies your server, which verifies the file exists and processes it.
Upload Size Limits
Enforce limits at multiple layers:
- Frontend: Check
file.sizebefore starting the upload. Show an immediate error for oversized files. - Web server: Nginx
client_max_body_size, ApacheLimitRequestBody. Returns 413 if exceeded. - Application: Framework-level limits (multer
fileSize). Provides custom error messages. - Storage policy: Presigned URL policies can enforce maximum size.
- Set different limits based on user tier (free users: 10 MB, paid users: 100 MB). Document limits clearly in the UI.
Resumable Uploads
For unreliable networks or very large files:
- TUS protocol: Open protocol for resumable uploads. Client and server libraries available for all major platforms. Handles chunking, resume, and progress automatically.
- Custom implementation: Track uploaded byte ranges on the server. Client queries for the last received byte and resumes from there.
- Store upload state in Redis or database with TTL. Clean up incomplete uploads after 24-48 hours.
Best Practices
- Generate unique filenames server-side. Never use user-provided filenames for storage paths.
- Validate file types by content inspection, not just extension or MIME header.
- Stream large files to storage rather than buffering in memory.
- Use presigned URLs for direct-to-storage uploads when possible to reduce server load.
- Set size limits at every layer of the stack. Scan for malware before making files accessible.
- Clean up orphaned uploads with a periodic job. Log all upload activity for audit purposes.
- Serve uploaded files from a separate domain or CDN to prevent cookie leakage and XSS attacks.
Anti-Patterns
- Storing uploads in the web root: Attackers can upload and execute server-side scripts. Store outside the web root.
- Trusting Content-Type headers: The client can set any Content-Type. Always inspect actual file content.
- Buffering entire files in memory: Concurrent large uploads will exhaust server memory. Stream to disk or storage.
- No size limits: Without limits, an attacker can fill storage with a single request. Enforce limits at every layer.
- Using original filenames as storage keys: Filenames can contain path traversal characters (
../) or special characters. Always generate safe names. - Missing cleanup for failed uploads: Incomplete chunked uploads accumulate. Schedule cleanup jobs to reclaim storage.
- Synchronous processing: Image resizing, virus scanning, and transcoding should happen asynchronously, never blocking the upload response.
Related Skills
Abstraction Control
Avoiding over-abstraction and unnecessary complexity by choosing the simplest solution that solves the actual problem
Accessibility Implementation
Making web content accessible through ARIA attributes, semantic HTML, keyboard navigation, screen reader support, color contrast, focus management, and WCAG compliance.
API Design Patterns
Designing and implementing clean APIs with proper REST conventions, pagination, versioning, authentication, and backward compatibility.
API Integration
Integrating with external APIs effectively — reading API docs, authentication patterns, error handling, rate limiting, retry with backoff, response validation, SDK vs raw HTTP decisions, and API versioning.
Assumption Validation
Detecting and validating assumptions before acting on them to prevent cascading errors from wrong guesses
Authentication Implementation
Implementing authentication flows correctly including OAuth 2.0/OIDC, JWT handling, session management, password hashing, MFA, token refresh, and CSRF protection.