Twitter

Staff Lead Software Engineer - Cortex Platform

Twitter

April 21, 2021

Company Description
Twitter is what’s happening and what people are talking about right now. For us, life's not about a job, it's about purpose. We believe real change starts with conversation. Here, your voice matters. Come as you are and together we'll do what's right (not what's easy) to serve the public conversation.
Job Description
Who We Are
Cortex empowers internal teams to efficiently leverage ML by providing a platform and by unifying, educating, and advancing the state of the art in ML technologies within Twitter. We win when our customers win by helping our users stay informed, share and discuss what matters; by serving the public conversation. We’re building an AI-first company and every major initiative is increasingly dependent on the successful application of machine learning. Cortex is at the nexus of this evolution.
Our team of ML software engineers are constructing one of the strongest machine learning platforms in the world by marrying the latest ML industry practices with engineering excellence and the need to perform at Twitter scale. Our customers are all the ML engineers at Twitter and our goal is to provide a unified tooling ecosystem that allows these engineers to focus on what they are good at, building ML models with novel approaches, and abstract the way the complexities of bringing these models into a production environment.
We care deeply about:

  • Engineering excellence such as good design abstractions, API stability, unit testing, leading best practices for other engineers to follow, and solid documentation.

  • Staying abreast and compatible with a quickly shifting technology landscape for ML platform components and related open source solutions.

  • Creating the best ML Platform environment for Twitter that provides an exceptional developer experience for our engineering customers.

  • Encouraging engineering creativity and innovative solutions

Our Current projects include:

  • Establishing Kubeflow as a managed offering at Twitter

  • Enabling and sustaining GCP Infra/Platform components for broader use in Cortex platform; e.g. AI Platform, Dataflow, Data Proc, etc.

  • Improving Operations of essential ML Platform services

    • Hosted notebooks

    • Centralized ML Metastore

    • Centralized ML Dashboards

If this sounds like a team you want to be part of, great! We are looking for engineers who are passionate about writing code, have a desire to learn new technologies, love working in collaborative teams, and are committed to serving their customers.
Your responsibilities include:

  • Informing and accelerating GCP Infrastructure adoption best practices (sustaining and improving User Onboarding, IAM, Image Management, Twitter Systems Integrations, Security et al)

  • Absorbing existing SRE/Operational support scopes (GPU Cluster Management, OS/Kernel Upgrades, RPM/Python Dependency Management, Bare Metal Host Management/Puppet Manifests, etc)

  • Partnering and supporting existing Cortex Platform teams with Operational guidance and expertise on various project initiatives

  • Creating tools and automation for Operational support and management for DS/ML use cases

  • Supporting various users and developers with operational issues (e.g. “I’m having trouble scheduling GPU jobs with Persistent Volumes”)

  • Capacity Planning

  • Maintaining the version updates of Tensorflow / PyTorch et al

  • Partner with Twitter’s Platform and Data Platform orgs to improve, enhance and influence direction and integration opportunities

  • Partner with teams to improve, enhance and integrate with the company’s GCP Adoption & Management strategy

Qualifications

Who You Are


  • Experienced working with Kubernetes and Kubeflow, huge plus if you are an OSS contributor to the project

  • Experienced leading and mentoring technical teams through design and implementation across an organization.

  • Minimum 6+ years of handling services in a large scale distributed systems environment, preferably services on GCP e.g. BigQuery, etc.

  • Expert knowledge of Linux operating system internals, filesystems, disk/storage technologies and storage protocols and networking stack.

  • Expert knowledge of systems programming (bash and shell tools) and practical, proven knowledge of at least one higher-level language (Python, Go or Scala).

  • Comfortable working with on-prem and cloud-based infrastructure (GCP) in terms of deployment, support, monitoring, administration and troubleshooting.

  • Track record of practical problem solving, excellent communication, and documentation skills

  • Proven understanding of systems and application design, including the operational trade-offs of various designs.

  • Work well with and be able to influence a myriad of personalities at all levels.

  • Be adaptable and able to focus on the simplest, most efficient & reliable solutions.

  • Solid understanding of algorithms, distributed systems design and the software development lifecycle

Additional Information
All your information will be kept confidential according to EEO guidelines. We are committed to an inclusive and diverse Twitter. Twitter is an equal opportunity employer. We do not discriminate based on race, color, ethnicity, ancestry, national origin, religion, sex, gender, gender identity, gender expression, sexual orientation, age, disability, veteran status, genetic information, marital status or any legally protected status.
We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.