Seenit Tech Report Q3 2018
At Seenit we take the reliability of our platform very seriously. We use the full range of tools offered by the Google Cloud Platform and Couchbase to allow us to develop and operate our platform with no disruption to our customers and collaborators.
The philosophy we have used is simple, keep the complexity down, leverage Google Cloud features where possible, rely on multiple levels of redundancy and automate everything. This is all monitored by the best in class monitoring provided by Datadog. All of this means we can spend our time focusing on improvements rather than fighting fires.
In order to achieve our desired platform availability Seenit utilises the following;
Multi Data centre Deployment
By using multiple Data centres we can protect ourselves from issues affecting one datacentre.
Autoscaling Node pools
Each external facing application is run as a Autoscaling Node pool. This means that we have a pool of web servers, all accessed via a Google Load Balancers that expand and contract with the load on the platform. Any issues with any web server will be detected by the load balancer and removed, so as to not impact the the users.
Autoscaling Kubernetes for the backend processing.
We love tools that make our life easier, and Kubernetes has improved the deployment and scaling of our backend video processors. 1 Click rollout of new releases, autoscaling and auto restart on failure gives as a very resilient platform. Issues with any server and the load is moved to another server automatically. In order to prove our platform is resilient we use a Chaos Monkey approach.
With any platform, your data is key, and because of this we utilise a Distributed multi data centre database, Couchbase, and a clustered Message Broker (RabbitMQ) for intersystem communication. Couchbase provides scaleability and resiliency due to its distributed architecture. Its has cross data centre replication, so you always have a current copy of the data in your offline data centre. Couchbase will detect any issues with a server and remove it from the cluster with minimum distribution to the service.Its also very simple to deploy and manage, giving us more time to focus on other things.
All systems are monitored by DataDog Monitoring for rapid visibility of any potential issues. Our engineers can be notified of any issues via Slack or email before they become service impacting.
Testing and Deployment
It’s not good having solid infrastructure if the code isn’t up to standard or if you cause issues deploying new releases. All code has to go through our rigorous test pipeline, and once the code has passed, our automatic deployment tools will roll them out to production, and more importantly, if there are any issues, the same tools can restore the old release, to minimise customer impact.
Our Q3 review returned the following results
Uptime 100% - No outages
Successful Studio Requests 99.994% or 30 errors per month out of 800 000 requests
Uploads 99.3% Processed Successfully
We are constantly looking to improve this, and having just finished the Kubernetes migration, we are now looking to improve our automatic reprocessing of failed uploads, as we know how valuable your videos are.