I began as a Senior Secure Application Engineer in 2017 and was skip promoted past staff engineer to Application Architect after 18 months.
Below are a few of my contibutions
When I started at Crunchyroll, I joined a team of 6-8 engineers building a new subscription processing system for a subscription video on demand (SVOD) service with 2 Million subscribers. Building a multi-tenant system to process over $10 Million a month in credit card transactions presents a number of challenges.
The systems being replaced were single tenant apps written in PHP having strange edge cases and known bugs. A new multitenant system was envisioned to replace both systems in Go.
A few things I did on the project:
The weekend after I started, the website went down for several hours on a Saturday night when a new episode of a popular series went live on the site. As a member of the Secure Apps team, I was confident that the systems for which I was responsible were not the cause of service disruption. Nevertheless, I took it upon myself to investigate the stability issues to see if I could help other teams pinpoint the source of the problem.
After a few hours sifting through NewRelic, Cloudwatch and access logs, I had a good understanding of the various service resiliency failures and decided that the application's stability issues could be fixed with a microcache.
So, that weekend, I went home and I wrote an open source embedded HTTP cache in Go.
A strong 5 minute TTL on API responses in the CDN protected the external API from traffic bursts. However, these protections were not available on internal API calls from NodeJS which were routed directly to the scaling group's load balancer.
The NodeJS app also had a 6 second timeout such that it would attempt to fetch data from internal APIs for up to 6 seconds at which point a response would be returned to the client regardless of how much data had been collected. This was problematic because there was no thread shutdown on timeout so after that 6 second timeout was breached for a request due to downstream latency during high traffic, outstanding requests would continue to be sent to downstream internal APIs even after a response had been returned to the client.
This abberant behavior resulted in a request multiplication feeback loop ensuring that if any of the system's internal APIs ever experienced increased latency for any reason, the entire application would beat itself to death with duplicate requests.
With no additional infrastructure, the http caching middleware was integrated into the API and it never failed in the same way again.
A few weeks later, the front-end team introduced an error to production which effectively doubled the traffic to the API. This sudden traffic spike caused no service disruption thanks to the cache and actually improved the efficiency of the API.
I jumped in to fix the problem without being asked to do so because where I come from, engineers take notice when the website goes down.
The first project I lead at Crunchyroll was a third party integration with AT&T.
Basically, AT&T would run a promotion offering some of its premium subscribers a free subscription to Crunchyroll's video on demand service.
As the project's tech lead, I:
QA was fun. AT&T actually sent us batches of SIM cards linked to accounts in various states in the mail that we would need to insert into phones in order to perform end to end testing. This applied a physical limitation to the number of end-to-end tests we could run and completely decimated any hopes of automated cross-organization integration testing. Internal automated unit and integration tests ensured that things on our end continued to operate as expected.
Integrating with large established enterprises requires plasticity.
It would have been nice if they had integrated with our API rather than delivering information via SFTP, but in the end it wasn't worth the fight. We were the smaller, more agile company and it made a lot more sense for us to build a one-off SFTP drop point to work with their existing systems than to ask them to integrate with us. Bending to the will of giants yielded a more stable system built in less time.
Much has been written about microservice standardization in Go.
There are a million different ways to structure a repository. Some people think that in a large organization with dozens of microservices, it's important to have consistency across microservices. This is true to some extent, though I felt as an application architect that there was more important work to be done.
Note that these say nothing about the structure of a service's packages. You can dump it all in the project root if you feel like it. This is really basic stuff. It doesn't begin to touch side cars or circuit breakers, but you've got to start somewhere.
Any misbehaving microservice should be treated as disposable. Rewriting a microservice should not require an act of congress.
Crunchyroll has multiple tenants feeding client and server side events to Segment. With billions of events per month, the pricing was becoming unreasonable at the same time that the business wanted to emit more events from more clients. Not all of these events needed to go through Segment. Segment only served as an intermediary between clients and S3 for the highest volume event types.
So, it was decided that events dispatched from the 20+ clients would be routed instead to an event collection service which would write all events to the data lake in S3 and proxy some events to Segment as necessary for product metrics.
As a co-lead of the project, I:
The end result was an API capable of efficiently routing > 3,000 events per second. The only bottleneck was the number of kinesis shards which had not yet been fitted with a reshard lambda for autoscaling.
One important aspect of making this service efficient was the addition of batching. 80ms of artificial latency was added to the API in order to enable batch writes to Kinesis.
This graph shows the introduction of artificial latency.
Below you'll see CPU time per request drop significantly after batching is introduced.
In fact, the trend is reversed. The service now becomes more CPU efficient under higher load due to batching.
Scalability is not boolean. Most would say that a service that scales linearly is a scalable service. And they'd be right. But I'd argue that there's a higher goal we should all aspire to and that's sub-linear scalability (illustrated above).
Batching is an easy way to acheive economy of scale in web service APIs.
Architecture is the thing that happens before the building. I quit Crunchyroll because the company's biggest projects for which I was refused any involvement in planning were headed toward certain disaster and as an architect I wasn't willing to accept responsibility for the architecture of new systems without having any involvement in their architecture. Rather than involving more technical leadership in the planning process, upper management decided to establish a new layer of non-technical project management between product and engineering which I perceived as a strong movement in the wrong direction. I had hoped that Crunchyroll's board of directors would find a permanent CTO to fix these organizational problems, but I now suspect it's more likely that the entire engineering department will be absorbed by their new parent company (Warner Media) due in no small part to the problems caused by a severe lack of architectural involvement during planning.
Crunchyroll is a great place with a lot of amazing people.
I enjoyed my time there and wish them all the best.
Crunchyroll will always be a dearly beloved brand for Anime fans across the globe.