Engineer - Guest Reliability (Site Reliability)
JOIN US AS A GUEST RELIABILITY ENGINEER We're building a BRAND NEW reliability team here at Target HQ - and you can be part of standing it up! The Guest Reliability Engineer is responsible for driving the reliability of our applications and infrastructure so that we avoid - or if we cannot avoid quickly resolve - service disruptions. As a GRE at Target, you will do this via a combination of IT operational work and automating your learnings from doing this work. Put simply, you will substitute software for human labor in recoveries of our systems. In addition, you'll get to work on ensuring the following: availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning for our systems. Job Responsibilities
- Design, write and deliver software/automation to improve the recoverability, availability, scalability, latency, and efficiency of products.
- Monitor and recover multiple applications within the following product groups of Stores, Supply Chain, Corporate Applications, and Infrastructure.
- Provide preventative activities, proactive monitoring, troubleshooting and quick resolution of events and incidents to ensure infrastructure and application stability.
- Prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
- Design, write and deliver monitors and dashboards that improve predictability and are actionable in a proactive manner.
- Day-to-day operational management, including response, incident, event and problem management activities.
- Understand different platforms, applications, hardware and infrastructure and how they interact.
- Enhance Knowledge repository that helps reduce recovery times for disruptions in service.
- Consult with architects and other senior engineers across projects or services to complete architectural and technical design deliverables
- Provide technical oversight to others resolving high severity hardware, operational, infrastructure and application incidents
- Oversee preventative maintenance, troubleshooting and quickly resolve problems to ensure infrastructure and application stability
- Provide thought-leadership within team and to the broader community to promote re-use and develop consistent technical build, implementation and support processes
- Lead the design, lifecycle management, and total cost of ownership of platforms, applications and infrastructure services
- Provide input to strategic technical roadmap for platform or infrastructure services
- BA/BS or equivalent experience
- 1-3 years total work experience
- Has in-depth knowledge of state-of-the art engineering technical approaches in design, build, testing, debugging problems as required by domain
- Maintains technical knowledge within areas of expertise
- Stays current with new and evolving technologies via formal training and self-directed education
- Proficient in the following technology areas:
- Java/C#/C++ Programming Languages
- MySQL or SQL Server
- PowerShell, Ruby, or Python
- Knowledge of scripting languages and skills to build scripting and automation, VBScript, Windows PowerShell, Perl, Windows Management Instrumentation, Windows Remote Management, and Microsoft System Center suite of tools
- Technical aptitude and skills around Microsoft Windows, with desire to build domain application knowledge and ServiceNow skills.
- Technical knowledge of operations hardware and applications.
- Excellent communication skills and ability to manage vendor partners.
- Strong reasoning, troubleshooting, problem solving and analytical skills.
- A desire to not do repetitive activities instead utilize coding skills to reduce human labor.