To make sure we can train our models well, we often need data from our clients. These files can be very large, from 5GB to many terabytes. To safely and efficiently get these big files from our clients, we've set up a system where the files can be uploaded in smaller chunks. This helps us make sure the upload process goes smoothly.
1. Chunked Upload for Improved Reliability and Resumability
Rather than attempting to upload the entire file at once, which can lead to issues such as connection timeouts and data loss, we divide the file into smaller chunks. These chunks are typically around 5MB in size, which allows for a more stable and reliable upload process. Additionally, by breaking the file into smaller pieces, we are able to resume the upload if any issues arise during the process. This means that if the connection is lost or there is an error during the upload, the process can simply be resumed from where it left off rather than starting over from the beginning.
2. Secure Transfer Using Signed S3 URLs
To ensure that only authorized parties can access the uploaded files, we use signed S3 URLs to securely transfer the chunks to an S3 bucket. These URLs contain a unique signature that verifies the authenticity of the request and allows the chunk to be safely uploaded to the designated S3 bucket. This helps to protect the confidentiality and integrity of the uploaded files.
3. Error Recovery System to Minimize Data Loss
Despite the measures we have in place to ensure a smooth and reliable upload process, issues can still arise. To minimize the potential for data loss in these situations, we have implemented an error recovery system that runs a cleanup to recover any lost chunks at the end of an upload session. This helps to make the process error-proof and ensures that all the uploaded data is preserved.
4. User-Friendly Experience with Automatic Resuming and Local Storage
We understand that large file uploads can be time-consuming, potentially taking a full night or more to complete. To make the process seamless for our users, we have implemented several features to improve the experience. For example, rather than requiring the user to manually resume the upload process if there are any interruptions, the upload is automatically continued. Additionally, to ensure that the process can be resumed even after a refresh, browser restart, or computer restart, we store the file in a local database on the user's computer. This allows the upload to be resumed from where it left off even if the user's device is restarted or the browser is closed.
In summary, our large file upload system is designed to provide a secure and reliable experience for our users, while also ensuring the efficient and accurate handling of large files. By implementing a chunked upload system, secure transfer using signed S3 URLs, and an error recovery system, we are able to minimize the potential for issues and ensure a smooth and successful upload process.