Software Engineer - Distributed Training
- Clockwork.io
- Palo Alto, California
- 10mo ago
- Full-time
- On-site
Clockwork.io – A Software-Driven Revolution in AI Networking
Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI networking, delivering deterministic time, ultra-low latency, and seamless scalability for modern distributed systems.
To learn more, visit www.clockwork.io.
We are looking for an experienced software engineer to help build, optimize, and maintain large-scale distributed training infrastructure based on the PyTorch ecosystem. This role focuses on production-grade training workflows involving multi-GPU and multi-node orchestration, high-performance communication layers, and advanced parallelism strategies.
You’ll work alongside infrastructure and machine learning teams to ensure training jobs are efficient, scalable, and resilient.
Preferred Skills
Bonus Skills
Clockwork Systems is an equal opportunity employer. We are committed to building world-class teams by welcoming bright, passionate individuals from all backgrounds. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity or expression, national origin, disability, or protected veteran status. We believe diversity drives innovation, and we grow stronger together.