I recently co-chaired the first conference on Machine Learning Ops - USENIX OpML 2019. It was an energetic gathering of experts, practitioners and researchers who came together for one day in Santa Clara CA to talk about the problems, practices, new tools and cutting edge research on Production Machine Learning in industries ranging from finance, insurance, healthcare, security, web scale, manufacturing, and others.
While there were many great presentations, papers, panels and posters (too many to talk about individually - check out all the details here), there were several emergent trends and themes. I expect each of these will expand and become even more prominent over the next several years as more organizations push ML into production and use machine learning ops practices to scale ML in production.
Agile Methodologies Meet Machine Learning
Many practitioners emphasized the importance of iteration and continuous improvement to successful production ML. Much like software, machine learning improves through iteration and regular production releases. Those who have ML running at scale make it a point to recommend that projects should start with either no ML or simple ML to establish a baseline. As one practitioner put it, you don’t want to spend a year investing in a complex Deep Learning solution, only to find out after deployment that a simpler non-ML method can outperform it!
Bringing agility to ML also requires that the infrastructure be optimized to support agile rollouts. This means that successful production ML infrastructure includes automated deployment, modularity, use of microservices and avoiding excessive fine-grained optimization early on.
Recognition that ML Bugs Differ from Software Bugs, ML Specific Production Diagnostics
Various presentations provided memorable examples of how ML errors not only bypass conventional production checks; they can actually look like better production performance! For example - an ML model that fails and generates a default output can actually cause a performance boost!
Detecting ML bugs in production requires specialized techniques like Model Performance Predictors, comparisons with non-ML baselines, visual debugging tools and metric driven design of the operational ML infrastructure. Facebook, Uber and other organizations experienced with large scale production machine learning emphasized the importance of ML specific production metrics that range from health checks to ML specific (such as GPU) resource utilization metrics.
Rich Open Source Ecosystem for All Aspects of Machine Learning Ops
The rich open source ecosystem for model development (with TensorFlow, ScikitLearn, Spark, Pytorch, R, etc.) is well known. OpML showcased how the open source ecosystem for Machine Learning Ops is growing rapidly, with powerful publicly available tooling used by large and small companies alike. Examples include Apache Atlas for Governance and Compliance, Kubeflow for machine learning ops on Kubernetes, MLFlow for lifecycle management and Tensorflow tracing for monitoring. Classic enterprise vendors are starting to integrate these open source packages to provide full solutions for their customers. An example is Cisco’s support of Kubeflow. Furthermore, web-scale companies are open sourcing the core infrastructure that drives their production ML, such as the ML Orchestration tool TonY from LinkedIn.
As these tools become more prominent, practitioners are also documenting end-to-end use cases, creating design patterns that can be used as best practices by others.
Cloud-based Services and SaaS Make Production ML Easier
For a team trying to deploy ML in production for the first few times, the process can be daunting, even with open source tools available for each stage of the process. The cloud offers an alternative. Since the resource management aspects (such as machine provisioning, auto-scaling, elasticity, etc.) are handled by the cloud backend, cloud deployments can be simpler. When accelerators (GPUs, TPUs, etc.) are used, production resource management can be challenging and using cloud services is a way to get started by leveraging the investments made by cloud providers to optimize accelerator usage.
Cloud deployment can also create a ramp-up path for an IT organization to try ML deployment without a large in-house infrastructure roll out. Even on-premise enterprise deployments are moving to a self-service production ML model  similar to that of a cloud service, enabling an IT organization to serve the production ML needs of multiple teams and business units.
Expertise Leverage: Web-based At-scale ML Operations to Enterprise
At-scale experts like LinkedIn, Facebook. Google, Airbnb, Uber, and others, who were the first ML adopters, had to build from scratch all the infrastructure and practices needed to extract monetary value out of ML. These experts are now sharing not only their code but also their practice experiences and hard-won learnings, all of which can be adopted for the benefits of enterprise. As the Experts Panel at OpML pointed out, the best practices that these organizations follow for ML infrastructure (from team composition and reliability engineering to resource management) contain powerful insights that enterprises can benefit from as they seek to expand their production ML footprint. Experiences from scale ML deployments at Microsoft and others can show enterprises how to deliver machine learning into their business applications.
Other end-to-end experiences from at-scale companies showed how business metrics could be translated into ML solutions, and the consequent ML solution iteratively improved for business benefit. Finally, organizations facing the unique challenges that Edge deployment places on Machine Learning Ops can benefit from learning of scale deployments already in place.
A great op-ed piece by Michael Jordan in Medium - “Artificial Intelligence: The Revolution Hasn’t Happened Yet”, highlighted the need for an AI engineering practice. OpML 2019, the first Machine Learning Ops conference, illustrated how the ML and AI industry is maturing in this direction, with more and more organizations either struggling with the operational and lifecycle management aspects of production Machine Learning or pushing to scale ML operations and develop operational best practices. This is great news for the AI industry since it is a step further towards generating real ROI from AI investments. Trends like those above should help realize the long-awaited potential of AI-generated business value.
Nisha Talagala, contributor, is Co-Founder, CTO and VP, Engineering at ParalelM. Nisha has more than 15 years of expertise in software development, distributed systems, I/O solutions, persistent memory, and flash. Prior to ParallelM, Nisha was a Fellow at SanDisk and Fellow/Lead Architect at Fusion-io, where she drove innovation in non-volatile memory, in particular the industry’s first persistent memory solution. She was technology lead for server flash at Intel – where she led server platform non-volatile memory technology development, storage-memory convergence, and partnerships. Before joining Intel, Nisha was the CTO of Gear6, where she designed and built clustered computing caches for high performance I/O environments. Nisha earned her PhD at UC Berkeley with research on software clustering and distributed storage. Nisha holds 59 patents in distributed systems, networking, storage, performance and non-volatile memory and serves on multiple industry and academic conference program committees.