Principal Machine Learning Ops Engineer - High Performance Computing (HPC MLOps) Remote OR HQ

Brooklyn Park, Minnesota
Dec 18, 2021
Feb 04, 2022
Employment Status
Full Time

About us:

Target is an iconic brand, a Fortune 30 company and one of America's leading retailers. Target as a tech company? Absolutely. We're the behind-the-scenes powerhouse that fuels Target's passion and commitment to cutting-edge innovation. We anchor every facet of one of the world's best-loved retailers with a strong technology framework that relies on the latest tools and technologies-and the brightest people-to deliver incredible value to guests online and in stores. Target Technology Services is on a mission to offer the systems, tools and support that guests and team members need and deserve. Our high-performing teams balance independence with collaboration, and we pride ourselves on being versatile, agile and creative. We drive industry-leading technologies in support of every angle of the business, and help ensure that Target operates smoothly, securely and reliably from the inside out.

As a Principal Engineer, you serve as the technical anchor for the engineering team that supports a product. You create, own and are responsible for the application architecture that best serves the product in its functional and non-functional needs. You identify and drive architectural changes to accelerate feature development or improve the quality of service (or both). You have deep and broad engineering skills and are capable of standing up an architecture in its whole on your own, but you choose to influence a wider team by acting as a "force multiplier". Core responsibilities of this job are described within this job description. Job duties may change at any time due to business needs.

Use your skills, experience and talents to be a part of groundbreaking thinking and visionary goals. As a principal Engineer, you'll take the lead as you use your technology acumen to apply and maintain knowledge of current and emerging technologies within specialized area(s) of the technology domain. Evaluate new technologies and participates in decision-making, accounting for several factors such as viability within Target's technical environment, maintainability, and cost of ownership. Initiate and execute research and proof-of-concept activities for new technologies. Lead or set strategy for testing and debugging at the platform or enterprise level. In complex and unstructured situations, serve as an expert resource to create and improve standards and best practices to ensure high-performance, scalable, repeatable, and secure deliverables. Lead the design, lifecycle management, and total cost of ownership of services. Provide the team with thought leadership to promote re-use and develop consistent, scalable patterns. Participate in planning services that have enterprise impact. Provide suggestions for handling routine and moderately complex technical problems, escalating issues when appropriate. Gather information, data, and input from a wide variety of sources; identify additional resources when appropriate, engage with appropriate stakeholders, and conduct in-depth analysis of information. Provide suggestions for handling routine and moderately complex technical problems, escalating issues when appropriate. Develop plans and schedules, estimate resource requirements, and define milestones and deliverables. Monitor workflow and risks; play a leadership role in mitigating risks and removing obstacles. Lead and participate in complex construction, automation, and implementation activities, ensuring successful implementation with architectural and operational requirements met. Establish new standards and best practices to monitor, test, automate, and maintain IT components or systems. Serve as an expert resource in disaster recovery and disaster recovery planning. Stay current with Target's technical capabilities, infrastructure, and technical environment. Develop fully attributed data models, including logical, physical, and canonical. Influence data standards, policies, and procedures. Install, configure, and/or tune data management solutions with minimal guidance. Monitor data management solution(s) and identify optimization opportunities.

The High-Performance Computing (HPC) team at Target has built products ranging from data operations to neural computing. We provide the infrastructure and tools for ML engineers, AI scientists, AI engineers, data scientists, data engineers, and other team members to analyze and take actions on their data. We also research the capabilities of next-generation computer hardware, architectures, and algorithms in order to: build enterprise-grade, efficient products, and provide guidance on building scalable, fully utilized infrastructures.

As a Machine Learning Engineer of the HPC team at Target, you'll study recent developments and implement novel data engineering and machine learning components for high-performance machine learning development, operations, and deployment, including DevOps, DataOps, and MLOps practices. You'll also work closely with other AI engineers, AI scientists, and data engineers to guide them to use our high-performance ML infrastructure and platform to its maximum potential for high-efficiency machine learning. You'll receive hands-on experience and exposure to designing and building low-latency, high-performance, power-efficient hardware-aware machine learning pipelines. This is an excellent opportunity for engineers with a passion for deploying and managing novel enterprise infrastructure and tools on which core ML systems are run.

Location: Sunnyvale, CA OR Minneapolis, MN. Remote/FTE work is considered on a case-by-case basis.

  • Work directly with the HPC, Target Ai (advance algorithm team), and data engineers to develop and maintain DevOps tools to deploy ML systems on Heterogeneous Kubernetes clusters of GPUs and tensor processors.
  • Design, deploy and configure management systems for various ML software.
  • Optimize existing components for scalability, stability, computational load, speed, latency, etc. depending on the underlying hardware platform.
  • Build observability metrics to monitor deployed ML applications.
  • Provide technical guidance, mentorship of team members on development and Ops.
  • Document and design various deployment processes; update existing processes.

About you:
  • MS/PhD degree in Electrical Engineering, Computer Science or relevant work experience
  • Extensive experience with feature engineering and Machine Learning pipelines
  • Experience in building highly scalable distributed systems
  • Experience in the cloud and edge deployment of Machine Learning solutions

Additional Requirements:
  • Highly proficient in writing automation scripts in Python
  • Familiarity with ML frameworks, such as PyTorch and TensorFlow
  • Demonstrated knowledge of the Linux operating system
  • Proficient in the following technologies:
    • Kubernetes
    • Airflow
    • KubeFlow
    • Docker
    • Helm
    • Ansible
    • Consul

Americans with Disabilities Act (ADA)

Target will provide reasonable accommodations (such as a qualified sign language interpreter or other personal assistance) with the application process upon your request as required to comply with applicable laws. If you have a disability and require assistance in this application process, please visit your nearest Target store or Distribution Center or reach out to Guest Services at 1-800-440-0680 for additional information.


Similar jobs

Similar jobs