KevBurnsJr

Crunchyroll

I began as a Senior Secure Application Engineer in 2017 and was skip promoted past staff engineer to Application Architect after 18 months.

Below are a few of my contibutions

Subscription Processing System

When I started at Crunchyroll, I joined a team of 6-8 engineers building a new subscription processing system for a subscription video on demand (SVOD) service with 2 Million subscribers. Building a multi-tenant system to process over $10 Million a month in credit card transactions presents a number of challenges.

The systems being replaced were single tenant apps written in PHP having strange edge cases and known bugs. A new multitenant system was envisioned to replace both systems in Go.

A few things I did on the project:

  1. Designed the Database Schema

    The project had some architectural diagrams, but the data model was vague.
    I Created an ERD in MySQL Workbench (15 tables) and socialized it with the team.
    After several compromises and iterations we came to a workable solution (12 tables).
    I also convinced the team to adopt standard columns and foreign key constraints.

  2. Simplified the Architecture

    The original architecture called for 9 services and 4 queues.
    I was able to reduce that number to 4 services and 1 queue.

  3. Rebuilt the background workers

    The team had attempted to create a background worker sytem.
    It ran on everyone's VM except mine (because mine uses 2 CPUs).
    I spent 2 days trying to fix its architecture to prevent race conditions.
    I gave up trying to fix it and built a simpler thread safe version in 2 days.

  4. Explored edge cases

    The system supported multiple concurrent recurring subscriptions.
    It also prorated new subscriptions in order to batch subscription processing.
    So if you had 3 subscriptions, they would all renew on the same day of the month.
    Combined with payment failure scenarios and retry schedules, things could get complicated.
    I argued in favor of a centralized subscription state to prevent systemic inconsistency.

  5. Worked with the Data team

    There was obviously a lot of important reporting data generated by the system.
    I worked with the data team and made schema changes to enable easier reporting.

  6. Created a migration plan

    The existing system was known to have accounts in weird states due to edge cases.
    For this reason, I campaigned for a per-user migration strategy.
    Users were migrated one at a time from the old system to the new system in batches.
    The first batch was internal employees in order to prevent user disruption.
    Then accounts were migrated to the new system in batches of 10,000.

Site Reliability Engineering

The weekend after I started, the website went down for several hours on a Saturday night when a new episode of a popular series went live on the site. As a member of the Secure Apps team, I was confident that the systems for which I was responsible were not the cause of service disruption. Nevertheless, I took it upon myself to investigate the stability issues to see if I could help other teams pinpoint the source of the problem.

After a few hours sifting through NewRelic, Cloudwatch and access logs, I had a good understanding of the various service resiliency failures and decided that the application's stability issues could be fixed with a microcache.

So, that weekend, I went home and I wrote an open source embedded HTTP cache in Go.

  1. Client requests HTML page from NodeJS containing ReactJS app
  2. NodeJS calls to internal CMS API to prefetch data
  3. Client receives HTML page
  4. Any data not prefetched by NodeJS is fetched from CMS API directly

A strong 5 minute TTL on API responses in the CDN protected the external API from traffic bursts. However, these protections were not available on internal API calls from NodeJS which were routed directly to the scaling group's load balancer.

The NodeJS app also had a 6 second timeout such that it would attempt to fetch data from internal APIs for up to 6 seconds at which point a response would be returned to the client regardless of how much data had been collected. This was problematic because there was no thread shutdown on timeout so after that 6 second timeout was breached for a request due to downstream latency during high traffic, outstanding requests would continue to be sent to downstream internal APIs even after a response had been returned to the client.

This abberant behavior resulted in a request multiplication feeback loop ensuring that if any of the system's internal APIs ever experienced increased latency for any reason, the entire application would beat itself to death with duplicate requests.

With no additional infrastructure, the http caching middleware was integrated into the API and it never failed in the same way again.

A few weeks later, the front-end team introduced an error to production which effectively doubled the traffic to the API. This sudden traffic spike caused no service disruption thanks to the cache and actually improved the efficiency of the API.

I jumped in to fix the problem without being asked to do so because where I come from, engineers take notice when the website goes down.

Third Party Integration

The first project I lead at Crunchyroll was a third party integration with AT&T.

Basically, AT&T would run a promotion offering some of its premium subscribers a free subscription to Crunchyroll's video on demand service.

As the project's tech lead, I:

  • Worked closely with product to gather product requirements
  • Generated engineering requirements with a multi-phase schedule
  • Managed communication and tracked deliverables with external parties
  • Created architectural diagrams and proposed new microservices
  • Created new microservices and infrastructure automation to support them
  • Participated in conference calls with outsourced external teams in India
  • Helped debug architectural security flaws in external systems
  • Created test plans and coordinated QA access across staging environments
  • Worked with Data team to provide adequate information for reporting
  • Worked with finance team to devise invoice pipeline based on reporting
  • Created SFTP endpoint with static IP for backend integration

QA was fun. AT&T actually sent us batches of SIM cards linked to accounts in various states in the mail that we would need to insert into phones in order to perform end to end testing. This applied a physical limitation to the number of end-to-end tests we could run and completely decimated any hopes of automated cross-organization integration testing. Internal automated unit and integration tests ensured that things on our end continued to operate as expected.

Integrating with large established enterprises requires plasticity.

It would have been nice if they had integrated with our API rather than delivering information via SFTP, but in the end it wasn't worth the fight. We were the smaller, more agile company and it made a lot more sense for us to build a one-off SFTP drop point to work with their existing systems than to ask them to integrate with us. Bending to the will of giants yielded a more stable system built in less time.

Microservice Standardization

Much has been written about microservice standardization in Go.

There are a million different ways to structure a repository. Some people think that in a large organization with dozens of microservices, it's important to have consistency across microservices. This is true to some extent, though I felt as an application architect that there was more important work to be done.

  1. Microservices MUST produce error logs.
  2. Microservices MUST be configured for log centralization (ELK).
  3. Microservices MUST NOT produce access logs. Those come from ELB -> S3.
  4. Microservices MUST use log levels.
  5. Microservices MUST set log level to WARN in prod.
  6. Microservices MUST set log level to INFO in stage.
  7. Microservices MUST use log levels appropriately.
  8. Microservices SHOULD integrate with NewRelic (excluding some very high traffic APIs).
  9. Microservices MUST use config files and not cli flags for runtime config.
  10. Microservices MUST set User-Agent header for all downstream internal API requests.
  11. Microservices MUST set an http timeout for all downstream internal API requests.

Note that these say nothing about the structure of a service's packages. You can dump it all in the project root if you feel like it. This is really basic stuff. It doesn't begin to touch side cars or circuit breakers, but you've got to start somewhere.

Any misbehaving microservice should be treated as disposable. Rewriting a microservice should not require an act of congress.

Event Collection System

Crunchyroll has multiple tenants feeding client and server side events to Segment. With billions of events per month, the pricing was becoming unreasonable at the same time that the business wanted to emit more events from more clients. Not all of these events needed to go through Segment. Segment only served as an intermediary between clients and S3 for the highest volume event types.

So, it was decided that events dispatched from the 20+ clients would be routed instead to an event collection service which would write all events to the data lake in S3 and proxy some events to Segment as necessary for product metrics.

As a co-lead of the project, I:

  • Measured expected cost savings
  • Created an architectural proposal
  • Estimated development, maintenance and running costs (8 bil evt/mo)
  • Created the service
  • Added a user interface for QA
  • Automated infrastructure provisioning
  • Documented the new service
  • Worked with client teams to integrate the new service
  • Ran the service in dual-write mode in production
  • Worked with the data team to validate incoming data
  • Switched the new system to the system of record in production

The end result was an API capable of efficiently routing > 3,000 events per second. The only bottleneck was the number of kinesis shards which had not yet been fitted with a reshard lambda for autoscaling.

Efficiency

One important aspect of making this service efficient was the addition of batching. 80ms of artificial latency was added to the API in order to enable batch writes to Kinesis.

This graph shows the introduction of artificial latency.

Below you'll see CPU time per request drop significantly after batching is introduced.

In fact, the trend is reversed. The service now becomes more CPU efficient under higher load due to batching.

Scalability is not boolean. Most would say that a service that scales linearly is a scalable service. And they'd be right. But I'd argue that there's a higher goal we should all aspire to and that's sub-linear scalability (illustrated above).

Batching is an easy way to acheive economy of scale in web service APIs.

Conclusion

Architecture is the thing that happens before the building. I quit Crunchyroll because the company's biggest projects for which I was refused any involvement in planning were headed toward certain disaster and as an architect I wasn't willing to accept responsibility for the architecture of new systems without having any involvement in their architecture. Rather than involving more technical leadership in the planning process, upper management decided to establish a new layer of non-technical project management between product and engineering which I perceived as a strong movement in the wrong direction. When I quit, I indicated that a commitment from the CEO to find a new permanent CTO before the end of the year was the only thing that could convince me to stay. 6 months after I quit, my boss was promoted to CTO (before the end of the year) ¯\_(ツ)_/¯. The department presumably has a new direction since, as far as I can tell, the features those poorly planned systems were supposed to enable were never released.

Crunchyroll is a great place with a lot of amazing people.
I enjoyed my time there and wish them all the best.

Crunchyroll will always be a dearly beloved brand for Anime fans across the globe.