eBook | Jan 2023
'23
All rights reserved to Run:ai. No pa of this content may be used without express permission of Run:ai.
www.run.ai
The 2023 Sae of
AI Infrasrucure Surve
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Table of Contents
Introduction and Key Findings
Survey Repo Findings
Challenges for AI Development
AI/ML Stack Architecture
Tools Used to Optimize GPU Allocation Between Users
Plan to Grow GPU Capacity in the Next 12 Months
Aspects of AI Infrastructure Planned for Implementation (within 6-12 months)
Measurement of AI/ML Infrastructure Success
Tools Used to Monitor GPU Cluster Utilization
On-Demand Access to GPU Compute
Plans to Move AI Applications and Infrastructure to the Cloud
Percentage of AI/ML Models Deployed in Production
Main Impediments to Model Deployment
Demographics
About Run:ai
3
7
7
8
9
10
11
12
12
13
14
15
16
17
19
2
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
The aicial intelligence (AI) industry has grown
rapidly in recent years, as has the need for more
advanced and scalable infrastructure to suppo
its development and deployment. The global AI
Infrastructure Market was valued at $23.50 billion in
2021 and is expected to reach $422.55 billion by 2029,
at a forecasted CAGR of 43.50% between 2022-2029.
One of the main drivers of progress in the AI
infrastructure market has been increasing awareness
among enterprises of how AI can enhance their
operational eciency, aract new business and grow
new revenue streams, while reducing costs through
the automation of process ows. Other drivers include
the adoption of sma manufacturing processes using
AI, blockchain and IoT technologies, the increased
investment by GPU/CPU manufacturers in the
development of compute-intensive chips, and the
rising popularity of chatbots, like OpenAI’s recently
launched ChatGPT, for example.
The current hype around AI has given rise to a
renewed focus on geing it into the enterprise, and
organizations are increasingly eager to sta using and
developing AI applications themselves. But with an
abundance of new AI infrastructure tools f
looding
Introduction
and Ke Findings
Introduction
the rapidly evolving industry, it is akin to a so of
technological “Wild West”, with no real best practices
for enterprises to follow as they get AI into production.
As they begin to invest more heavily in AI, theres a lot
riding on how they decide to build their infrastructure
and service their practitioners.
This is the second ‘State of AI Infrastructure’ survey
we are running, due to all the new activity in the
industry and new AI companies in the AI space,
we’re keen to see what’s changed. We’re paicularly
interested in new insights into how organizations
are approaching the build of their AI infrastructure,
why they are building it, how they are building it,
what are the main challenges they face, and how the
abundance of dierent tools has aected geing AI
into production. We hope that the insights from this
survey will be helpful to those who both build and use
AI infrastructure.
3
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
To get more insight into the current state of AI
Infrastructure, we commissioned a survey of 450
Data, Engineering, AI and ML (Machine Learning)
professionals from a variety of industries. This repo
was administered online by Global Surveyz Research,
an independent global research rm. The survey is
based on responses from a mix of Data Scientists,
Researchers, Heads of AI, Heads of Deep Learning,
IT Directors, VPs IT, Systems Architects, ML Platform
Engineers and MLOps, from companies across the
Methodolog
US and Western EU ranging in size between under
200 and over 10,000 employees. The respondents
were recruited through a global B2B research panel
and invited via email to complete the survey, with all
responses collected during the second half of 2022.
The average amount of time spent on the survey
was ve minutes and fty seconds. The answers to
the majority of the non-numerical questions were
randomized, in order to prevent order bias in the
answers.
4
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
2
As organizations scale and require more GPUs, the
more complex it has become to build the right AI
infrastructure to get the right amount of compute
to all of the dierent workloads, tools, and end users.
80% of companies are now using third-pay tools,
and the more GPUs they require, the bigger their
reliance on multiple third-pay platforms, increasing
The more GPUs, the bigger the
reliance on multiple third-pa tools
from 29% reliance in companies with less than 50
GPUs, to 50% reliance in companies with more
than 100 GPUs (Figure 2). What would make more
sense, is a more open, middleware approach, where
organizations can use dierent tools that run on the
same infrastructure, so that they are not locked into
one end-to-end platform.
Ke Findings
1
A whopping 88% of survey respondents admied
to having AI development challenges (Figure 1),
which is telling in itself. But it’s also interesting
to note that Data, which was ranked by 61% of
respondents in last year’s survey as their main
challenge in AI development, was oveaken this
year by infrastructure (i.e., the dierent platforms
and tools that comprise “the stack”), and compute
(i.e., geing access to GPU resources, not having to
Data has been oveaken b
Infrastructure and Compute as the
main challenges for AI development
wait for resources, etc.) – chosen by 54% and 43% of
respondents respectively as their main challenges.
This year, Data ranked as the third biggest challenge
in AI development (41%). The fact that infrastructure
and compute-related challenges are now the top
concern for companies reinforces the impoance of
building the right foundation, for the right stack, to get
the most out of their compute.
5
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
4
While most companies are planning to grow their
GPU capacity or other AI infrastructure in the coming
year, for 88% of them (compared with 77% in last
year’s survey), more than half their AI/ML models
don’t make it to production. On average, only
37% of AI/ML models are deployed in production
environments (Figure 13). The main impediments to
In 88% of companies, more than half
of AI/ML models never make it to
production
actually deploying (Figure 14) include scalability (47%),
peormance (46%), technical (45%), and resources
(42%). The fact that they were all mentioned as a main
impediment” by such a substantial poion of the
respondents, shows that there isn’t just one glaring
impediment to model deployment, but rather a multi-
faceted one.
5
The vast majority (91%) of companies are planning
to grow their GPU capacity or other AI infrastructure
by an average of 23% in the next 12 months (Figure
4) despite the unceainty of the current economic
climate. Organizations won’t invest in AI unless
91% of companies are planning to
grow their GPU capacit or other AI
infrastructure in the next 12 months
they can actually get value out of it, so this result
is a resounding testament to the fact that most
companies see huge potential and value in continued
investment in AI.
Key Findings
3
Only 28% of the respondents have on-demand access
to GPU compute (Figure 8). When asked how GPUs
are assigned when not available via on-demand,
51% indicated they are using a ticketing system
(Figure 9), suggesting that on-demand access is still
On-demand access to GPU compute
is still ver low, with 89% of companies
facing resource allocation issues
regularl
lacking. So, it’s no wonder that 89% of respondents
face allocation issues regularly (Figure 10) – even
though some of them (58%) claim to have somewhat
automatic access – with 40% facing those GPU/
Compute resource allocation issues weekly.
6
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Infrastructure-related challenges
Compute-related challenges
Data-related challenges
Expense of doing AI
Training-related challenges
Dening business goals
We have no AI development challenges
When asked what their company’s main challenges
are around AI development, 88% of respondents
admied to having AI development challenges.
The top challenges are infrastructure related
challenges (54%), compute related challenges (43%),
and data related challenges (41%).
It’s interesting to note that infrastructure and
compute have oveaken Data as the biggest
challenges.
This reinforces the impoance of building the right
foundation, for the right stack, to get the most out of
your compute.
Figure 1: Challenges for AI development
Surve
Repo Findings
Challenges for AI Development
54%
43%
41%
34%
34%
18%
12%
7
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
When asked how their AI/ML infrastructure is
architected, 11% of respondents said it is all built
in-house, while 47% have a mix of in-house and
third-pay platforms. We also saw that the use of
multiple third-pay platforms grows with the number
of GPUs (29% for those with < 50 GPUs, and up to
50% for those with 100+ GPUs). This conrms that the
practice of taking AI into production and streamlining
it (MLOps) isn’t a one-size-ts-all process.
Organizations are using a mix of dierent tools to
build their own best-of-breed platforms to suppo
their needs (and those of their users).
The fact that the state of AI infrastructure appears
to be somewhat chaotic, with an abundance of tools
and no real best practices, is also testament to the
growing need among organizations for multiple
platforms to meet their various AI development
needs, giving rise to new technologies, and new
types of users and applications. But this could also
overwhelm infrastructure resources, so the more
GPUs companies have, theres also an increasingly
urgent need for a unied compute layer that suppos
all these dierent tools to make sure that the volume
of resources and the way they are accessed are
aligned to specic end users and the dierent tools
they need.
Figure 2: AI/ML stack architecture.
AI/ML Stack Architecture
All built in-house
Mix of in-house and 3rd pay platforms
Multiple 3rd-pay platforms
Only one 3rd-pay platform
7%
100+ GPUs
39%50%
3%
8%
51-100 GPUs
50%30%
12%
8%
< 50 GPUs
47%
29%
17%
Only one 3rd-pay
platform
Multiple 3rd-pay
platforms
All Respondents
Mix of in-house
and 3rd pay
platforms
All built
in-house
9% 11%
47%33%
8
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Tools Used to Optimize GPU
Allocation Between Users
Figure 3: Tools used to optimize GPU allocation between users
Open-source tools
Home-grown
HPC tools like Slurm or LSF
Excel spreadsheets / Whiteboard
Run:ai
Something else
Don't use tools to manage GPU allocation
36%
23%
18%
14%
7%
3%
0.8%
99% Using tools
9
According to the survey results, basically everyone
(99% of respondents) is using tools to optimize GPU
allocations between users, with open-source being
the most popular choice (36%), followed by home-
grown (23%).
Interestingly, the top two tools are also the most
challenging for organizations when taking AI into
production, because they are both very brile,
indicating there is room for more than half (59%) of
companies to move to a more professional way of
optimizing GPU allocations between users.
The fact that 73% are using open source, home-grown
tools or Excel sheets, also shows that organizations
are obviously facing a lot of issues allocating GPU
resources, so it appears that despite plenty of options,
there is still no clear or denitive way to optimize.
With the majority of respondents still using tools
that are not enterprise-grade, these tools will require
aention, especially as their organizations scale and
need to take more AI into production.
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Figure 4: Plan to grow GPU capacity in the next 12 months.
Weighted average: Increase by 23%
91% Plan to increase
9%
50%
39%
3%
No
increase
1-25%
increase
26-50%
increase
51%+
increase
It’s interesting to see that the vast majority of
companies (91%) are planning to grow their GPU
capacity or other AI infrastructure by an average of
23% in the next 12 months – despite the unceainty
surrounding the current economic climate.
Organizations won’t invest in AI unless they can
actually get value out of it, so this slide shows they’re
still denitely seeing value in continued investment
in AI.
Plan to Grow GPU Capacit
in the Next 12 Months
10
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Figure 5: Aspects of AI infrastructure planned for
implementation (within 6-12 months).
Monitoring, observability and explainability
Model depolyment and serving
Orchestration and pipelines
Data versioning and lineage
Feature stores
Distributed training
Synthetic data
50%
44%
34%
33%
29%
27%
19%
Aspects of AI Infrastructure
Planned for Implementation
(within 6-12 months)
11
When asked what aspect of AI infrastructure they
are looking to implement in the next 6-12 months,
respondents indicated that the top aspects relate
to challenges in production, including monitoring,
observability, and explainability (50%), model
deployment and serving (44%), and orchestration and
pipelines (34%), so it appears that their answers are
focused more on AI models in production and less
about model development.
Even the aspect least indicated for planned
implementation was mentioned by 19% or
respondents, which is still a respectable number,
so with all of the options provided indicated as
impoant, it shows that theres a lot going on”: theres
no one or two glaringly obvious priorities, but rather
multiple aspects of AI infrastructure that companies
need to focus on to get their AI into production
faster. Many of these aspects involve MLOps, which is
interesting, because each aspect requires dierent
tools, and therefore a solid computing platform to
plug all of these dierent tools into.
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Figure 6: Measurement of AI/ML infrastructure success.
Reaching new customers
Increasing our revenue
Saving time
Saving money
Beer serving our existing customers
Improving our products and services
Inspiring the creation of new products and services
28%
21%
17%
13%
11%
9%
2%
Figure 7: Tools used to monitor GPU cluster utilization.
NVIDIA-SMI
GCP-GPU-utilization metrics
NGPUtop
Nvtop
Run:ai's Rntop
86%
53%
52%
49%
35%
When asked how they are measuring the success
from their AI/ML infrastructure, 28% of respondents
said they are monitoring their reach to new
costumers, 21% are measuring their increase in
revenue, and 17% measure success by how much time
they are saving.
The top measures of success indicated by
respondents represent benets that are both internal
and external (both to the companies and their end
users), and clearly demonstrate that they are investing
in AI because they want to create value, which is
especially impoant during an unceain economic
climate.
The top tools used to monitor GPU cluster utilization
are NVIDIA-SMI (86%), GCP-GPU-utilization-metrics
(53%), and ngputop (49%).
It’s very hard to visualize problems with utilization, and
therefore very impoant for companies to get insight
into how they are utilizing their GPUs.
But with most respondents using NVIDIA-SMI (86%),
GCP-GPU-utilization-metrics (53%), and ngputop
(49%) to monitor GPU cluster utilization, it appears
it’s just as dicult nding the right tool to beer
understand their GPUs utilization, because they all
show metrics from one platform, or even one host,
and don’t provide a broad overview (or ‘the optimal
view’) of an organizations GPUs utilization across
its entire infrastructure. They only provide a current
snapshot’ of a small subset of the infrastructure.
Measurement of AI/ML
Infrastructure Success
Tools Used to Monitor
GPU Cluster Utilization
12
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
When asked about availability of on demand access
to GPU compute, only 28% of the respondents said
they have on-demand access (gure 8).
When asked how GPUs are assigned when not
available via on-demand, 51% of the respondents
indicated they are using a ticketing system (gure
9), suggesting that theres still no good on-demand
access yet. It’s no wonder, therefore, that 89% of
respondents face allocation issues regularly (Figure
10), even though some of them (58%) claim to have
somewhat automatic access.
Only 11% said they rarely run GPU allocation issues, 13%
have allocation issues daily, and 40% face allocation
issues weekly.
On-Demand Access to
GPU Compute
Figure 10: Frequency of GPU/compute resource allocation issues.
89% face allocation issues regularly
40%
13%
24%
13%
11%
WeeklyDaily Bi-weekly Monthly Rarely
Figure 8: Availability of on-demand access to GPU compute.
No automatic
access
Automatically
accessed
Somewhat
automatically
accessed
58%
14%
28%
Figure 9: GPUs assignment w/o on-demand access.
Assigned by
manual request
Assigned
by ticketing
system
Assigned
statically to
specic users/
jobs
31%
18%
51%
13
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Figure 11: Plans to move AI Applications and infrastructure to the
cloud.
Figure 12: Already on cloud” by access to GPU.
Automatic accessed
Somewhat automatic accessed
No automatic access
47%
46%
78%
51% of the respondents already have their applications
and infrastructure on the cloud, and 33% said they
were planning to move it to the cloud by the end of
2022.
When deep diving to see how those already in the
cloud dier based on their level of automatic access
to their GPU, we saw that of the 51% of companies
already on the cloud, the highest level of adoption
(78%) is by companies without automatic access.
All respondents indicated that they are either already
on the cloud or planning to move to the cloud either
this or next year, but according to these results, it
doesn’t necessarily solve their access to the GPU
problem, which is something that those who haven’t
moved to the cloud yet should be aware of.
Plans to Move AI Applications
and Infrastructure to the Cloud
33%
51%
16%
0% 0%
This yearAlready
on cloud
Next year In 5 years No plans
14
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Figure 13: Percentage of AI/ML Models Deployed in Production.
Weighted average: 37% of AI/ML models
88% deploying < 50%
21%
0% 0% 0%
32%
36%
12%
<10% 50-74%40-49%25-39%10-24% 75-90% 90%+
On average, 37% of AI/ML models are deployed in
production.
88% of respondents (compared with 77% in last year’s
survey) are deploying less than 50% of the models,
indicating that it’s even harder now to get things
into production, not just because of infrastructure
considerations but also business and organizational
ones as well.
It’s also why companies are investing so heavily in a
variety of dierent tools (as opposed to just one) to
get more AI into production, but as these results show,
theres still a lot of room for improvement.
Percentage of AI/ML Models
Deployed in Production
15
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Figure 14: Main impediments to model deployment.
Scalability
Peormance
Technical
Resources
Approval to deploy
Privacy / legal issue
47%
46%
45%
42%
23%
8%
The main impediments to actually deploying include
scalability (47%), peormance (46%), technical
(45%), and resources (42%). The fact that they were
all mentioned as a “main impediment” by a fairly
substantial poion of the respondents, shows once
again that there isn’t just one glaring impediment to
model deployment, but rather a multi-faceted one.
Main Impediments
to Model Deployment
16
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Demographics
Figure 17: Role
IT / Infrastructure Architect
Data Scientist / Researcher
MLOps / DevOps
Head of an AI / Deep Learning Intiative
Platforms Engineer
CTO
35%
33%
25%
5%
3%
1%
Figure 15: Country
Germany
France
United
States
United
Kingdom
18%
16%
16%
50%
R&D / Engineering
Data /
Data Science
IT
DevOps
Figure 16: Depament
26%
13%
24%
37%
C-Suite
Analyst
Team Member
Manager
VP / Head
Director
Figure 18: Job Seniority
31%
7%
6%
4%
17%
35%
17
Country, Depament, Role, Job Seniority
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Figure 19: Company size
9%
42%
2%
4%
6%
20%
40%
42%
16%
22%
<200 <10 GPUs201-500 11 - 50501-1K 51 - 1001,001-5K 100+5,001-10K >10K
Figure 20: GPU farm size
Company Size, GPU Farm Size
Amount of People
18
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Run:ai's Atlas Platform brings cloud-like simplicity to
AI resource management - providing researchers
with on-demand access to pooled resources for any
AI workload. An innovative cloud-native operating
system - which includes a workload-aware scheduler
and an abstraction layer - helps IT simplify AI
For more information please visit us:
hps://www.run.ai/
implementation, increase team productivity, and
gain full utilization of expensive GPUs. Using run:ai,
companies streamline development, management,
and scaling of AI applications across any
infrastructure, including on-premises, edge and cloud.
19
About Run:ai