The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

'23

www.run.ai

The 2023 Sae of

AI Infrasrucure Surve

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Table of Contents

Introduction and Key Findings

Survey Repo Findings

Challenges for AI Development

AI/ML Stack Architecture

Tools Used to Optimize GPU Allocation Between Users

Plan to Grow GPU Capacity in the Next 12 Months

Aspects of AI Infrastructure Planned for Implementation (within 6-12 months)

Measurement of AI/ML Infrastructure Success

Tools Used to Monitor GPU Cluster Utilization

On-Demand Access to GPU Compute

Plans to Move AI Applications and Infrastructure to the Cloud

Percentage of AI/ML Models Deployed in Production

Main Impediments to Model Deployment

Demographics

About Run:ai

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

The aicial intelligence (AI) industry has grown

rapidly in recent years, as has the need for more

advanced and scalable infrastructure to suppo

its development and deployment. The global AI

Infrastructure Market was valued at $23.50 billion in

2021 and is expected to reach $422.55 billion by 2029,

at a forecasted CAGR of 43.50% between 2022-2029.

One of the main drivers of progress in the AI

infrastructure market has been increasing awareness

among enterprises of how AI can enhance their

operational eciency, aract new business and grow

new revenue streams, while reducing costs through

the automation of process ows. Other drivers include

the adoption of sma manufacturing processes using

AI, blockchain and IoT technologies, the increased

investment by GPU/CPU manufacturers in the

development of compute-intensive chips, and the

rising popularity of chatbots, like OpenAI’s recently

launched ChatGPT, for example.

The current hype around AI has given rise to a

renewed focus on geing it into the enterprise, and

organizations are increasingly eager to sta using and

developing AI applications themselves. But with an

abundance of new AI infrastructure tools f

looding

Introduction

and Ke Findings

Introduction

the rapidly evolving industry, it is akin to a so of

technological “Wild West”, with no real best practices

for enterprises to follow as they get AI into production.

As they begin to invest more heavily in AI, there’s a lot

riding on how they decide to build their infrastructure

and service their practitioners.

This is the second ‘State of AI Infrastructure’ survey

we are running, due to all the new activity in the

industry and new AI companies in the AI space,

we’re keen to see what’s changed. We’re paicularly

interested in new insights into how organizations

are approaching the build of their AI infrastructure,

why they are building it, how they are building it,

what are the main challenges they face, and how the

abundance of dierent tools has aected geing AI

into production. We hope that the insights from this

survey will be helpful to those who both build and use

AI infrastructure.

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

To get more insight into the current state of AI

Infrastructure, we commissioned a survey of 450

Data, Engineering, AI and ML (Machine Learning)

professionals from a variety of industries. This repo

was administered online by Global Surveyz Research,

an independent global research rm. The survey is

based on responses from a mix of Data Scientists,

Researchers, Heads of AI, Heads of Deep Learning,

IT Directors, VPs IT, Systems Architects, ML Platform

Engineers and MLOps, from companies across the

Methodolog

US and Western EU ranging in size between under

200 and over 10,000 employees. The respondents

were recruited through a global B2B research panel

and invited via email to complete the survey, with all

responses collected during the second half of 2022.

The average amount of time spent on the survey

was ve minutes and fty seconds. The answers to

the majority of the non-numerical questions were

randomized, in order to prevent order bias in the

answers.

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

As organizations scale and require more GPUs, the

more complex it has become to build the right AI

infrastructure to get the right amount of compute

to all of the dierent workloads, tools, and end users.

80% of companies are now using third-pay tools,

and the more GPUs they require, the bigger their

reliance on multiple third-pay platforms, increasing

The more GPUs, the bigger the

reliance on multiple third-pa tools

from 29% reliance in companies with less than 50

GPUs, to 50% reliance in companies with more

than 100 GPUs (Figure 2). What would make more

sense, is a more open, middleware approach, where

organizations can use dierent tools that run on the

same infrastructure, so that they are not locked into

one end-to-end platform.

Ke Findings

A whopping 88% of survey respondents admied

to having AI development challenges (Figure 1),

which is telling in itself. But it’s also interesting

to note that Data, which was ranked by 61% of

respondents in last year’s survey as their main

challenge in AI development, was oveaken this

year by infrastructure (i.e., the dierent platforms

and tools that comprise “the stack”), and compute

(i.e., geing access to GPU resources, not having to

Data has been oveaken b

Infrastructure and Compute as the

main challenges for AI development

wait for resources, etc.) – chosen by 54% and 43% of

respondents respectively as their main challenges.

This year, Data ranked as the third biggest challenge

in AI development (41%). The fact that infrastructure

and compute-related challenges are now the top

concern for companies reinforces the impoance of

building the right foundation, for the right stack, to get

the most out of their compute.

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

While most companies are planning to grow their

GPU capacity or other AI infrastructure in the coming

year, for 88% of them (compared with 77% in last

year’s survey), more than half their AI/ML models

don’t make it to production. On average, only

37% of AI/ML models are deployed in production

environments (Figure 13). The main impediments to

In 88% of companies, more than half

of AI/ML models never make it to

production

actually deploying (Figure 14) include scalability (47%),

peormance (46%), technical (45%), and resources

(42%). The fact that they were all mentioned as a “main

impediment” by such a substantial poion of the

respondents, shows that there isn’t just one glaring

impediment to model deployment, but rather a multi-

faceted one.

The vast majority (91%) of companies are planning

to grow their GPU capacity or other AI infrastructure

by an average of 23% in the next 12 months (Figure

4) despite the unceainty of the current economic

climate. Organizations won’t invest in AI unless

91% of companies are planning to

grow their GPU capacit or other AI

infrastructure in the next 12 months

they can actually get value out of it, so this result

is a resounding testament to the fact that most

companies see huge potential and value in continued

investment in AI.

Key Findings

Only 28% of the respondents have on-demand access

to GPU compute (Figure 8). When asked how GPUs

are assigned when not available via on-demand,

51% indicated they are using a ticketing system

(Figure 9), suggesting that on-demand access is still

On-demand access to GPU compute

is still ver low, with 89% of companies

facing resource allocation issues

regularl

lacking. So, it’s no wonder that 89% of respondents

face allocation issues regularly (Figure 10) – even

though some of them (58%) claim to have somewhat

automatic access – with 40% facing those GPU/

Compute resource allocation issues weekly.

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Infrastructure-related challenges

Compute-related challenges

Data-related challenges

Expense of doing AI

Training-related challenges

Dening business goals

We have no AI development challenges

When asked what their company’s main challenges

are around AI development, 88% of respondents

admied to having AI development challenges.

The top challenges are infrastructure related

challenges (54%), compute related challenges (43%),

and data related challenges (41%).

It’s interesting to note that infrastructure and

compute have oveaken Data as the biggest

challenges.

This reinforces the impoance of building the right

foundation, for the right stack, to get the most out of

your compute.

Figure 1: Challenges for AI development

Surve

Repo Findings

Challenges for AI Development

54%

43%

41%

34%

18%

12%

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

When asked how their AI/ML infrastructure is

architected, 11% of respondents said it is all built

in-house, while 47% have a mix of in-house and

third-pay platforms. We also saw that the use of

multiple third-pay platforms grows with the number

of GPUs (29% for those with < 50 GPUs, and up to

50% for those with 100+ GPUs). This conrms that the

practice of taking AI into production and streamlining

it (MLOps) isn’t a one-size-ts-all process.

Organizations are using a mix of dierent tools to

build their own best-of-breed platforms to suppo

their needs (and those of their users).

The fact that the state of AI infrastructure appears

to be somewhat chaotic, with an abundance of tools

and no real best practices, is also testament to the

growing need among organizations for multiple

platforms to meet their various AI development

needs, giving rise to new technologies, and new

types of users and applications. But this could also

overwhelm infrastructure resources, so the more

GPUs companies have, there’s also an increasingly

urgent need for a unied compute layer that suppos

all these dierent tools to make sure that the volume

of resources and the way they are accessed are

aligned to specic end users and the dierent tools

they need.

Figure 2: AI/ML stack architecture.

AI/ML Stack Architecture

All built in-house

Mix of in-house and 3rd pay platforms

Multiple 3rd-pay platforms

Only one 3rd-pay platform

100+ GPUs

39%50%

51-100 GPUs

50%30%

12%

< 50 GPUs

47%

29%

17%

Only one 3rd-pay

platform

Multiple 3rd-pay

platforms

All Respondents

Mix of in-house

and 3rd pay

platforms

All built

in-house

9% 11%

47%33%

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Tools Used to Optimize GPU

Allocation Between Users

Figure 3: Tools used to optimize GPU allocation between users

Open-source tools

Home-grown

HPC tools like Slurm or LSF

Excel spreadsheets / Whiteboard

Run:ai

Something else

Don't use tools to manage GPU allocation

36%

23%

18%

14%

0.8%

99% Using tools

According to the survey results, basically everyone

(99% of respondents) is using tools to optimize GPU

allocations between users, with open-source being

the most popular choice (36%), followed by home-

grown (23%).

Interestingly, the top two tools are also the most

challenging for organizations when taking AI into

production, because they are both very brile,

indicating there is room for more than half (59%) of

companies to move to a more professional way of

optimizing GPU allocations between users.

The fact that 73% are using open source, home-grown

tools or Excel sheets, also shows that organizations

are obviously facing a lot of issues allocating GPU

resources, so it appears that despite plenty of options,

there is still no clear or denitive way to optimize.

With the majority of respondents still using tools

that are not enterprise-grade, these tools will require

aention, especially as their organizations scale and

need to take more AI into production.

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Figure 4: Plan to grow GPU capacity in the next 12 months.

Weighted average: Increase by 23%

91% Plan to increase

50%

39%

increase

1-25%

increase

26-50%

increase

51%+

increase

It’s interesting to see that the vast majority of

companies (91%) are planning to grow their GPU

capacity or other AI infrastructure by an average of

23% in the next 12 months – despite the unceainty

surrounding the current economic climate.

Organizations won’t invest in AI unless they can

actually get value out of it, so this slide shows they’re

still denitely seeing value in continued investment

in AI.

Plan to Grow GPU Capacit

in the Next 12 Months

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Figure 5: Aspects of AI infrastructure planned for

implementation (within 6-12 months).

Monitoring, observability and explainability

Model depolyment and serving

Orchestration and pipelines

Data versioning and lineage

Feature stores

Distributed training

Synthetic data

50%

44%

34%

33%

29%

27%

19%

Aspects of AI Infrastructure

Planned for Implementation

(within 6-12 months)

When asked what aspect of AI infrastructure they

are looking to implement in the next 6-12 months,

respondents indicated that the top aspects relate

to challenges in production, including monitoring,

observability, and explainability (50%), model

deployment and serving (44%), and orchestration and

pipelines (34%), so it appears that their answers are

focused more on AI models in production and less

about model development.

Even the aspect least indicated for planned

implementation was mentioned by 19% or

respondents, which is still a respectable number,

so with all of the options provided indicated as

impoant, it shows that there’s “a lot going on”: there’s

no one or two glaringly obvious priorities, but rather

multiple aspects of AI infrastructure that companies

need to focus on to get their AI into production

faster. Many of these aspects involve MLOps, which is

interesting, because each aspect requires dierent

tools, and therefore a solid computing platform to

plug all of these dierent tools into.

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Figure 6: Measurement of AI/ML infrastructure success.

Reaching new customers

Increasing our revenue

Saving time

Saving money

Beer serving our existing customers

Improving our products and services

Inspiring the creation of new products and services

28%

21%

17%

13%

11%

Figure 7: Tools used to monitor GPU cluster utilization.

NVIDIA-SMI

GCP-GPU-utilization metrics

NGPUtop

Nvtop

Run:ai's Rntop

86%

53%

52%

49%

35%

When asked how they are measuring the success

from their AI/ML infrastructure, 28% of respondents

said they are monitoring their reach to new

costumers, 21% are measuring their increase in

revenue, and 17% measure success by how much time

they are saving.

The top measures of success indicated by

respondents represent benets that are both internal

and external (both to the companies and their end

users), and clearly demonstrate that they are investing

in AI because they want to create value, which is

especially impoant during an unceain economic

climate.

The top tools used to monitor GPU cluster utilization

are NVIDIA-SMI (86%), GCP-GPU-utilization-metrics

(53%), and ngputop (49%).

It’s very hard to visualize problems with utilization, and

therefore very impoant for companies to get insight

into how they are utilizing their GPUs.

But with most respondents using NVIDIA-SMI (86%),

GCP-GPU-utilization-metrics (53%), and ngputop

(49%) to monitor GPU cluster utilization, it appears

it’s just as dicult nding the right tool to beer

understand their GPUs utilization, because they all

show metrics from one platform, or even one host,

and don’t provide a broad overview (or ‘the optimal

view’) of an organization’s GPUs utilization across

its entire infrastructure. They only provide a ‘current

snapshot’ of a small subset of the infrastructure.

Measurement of AI/ML

Infrastructure Success

Tools Used to Monitor

GPU Cluster Utilization

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

When asked about availability of on demand access

to GPU compute, only 28% of the respondents said

they have on-demand access (gure 8).

When asked how GPUs are assigned when not

available via on-demand, 51% of the respondents

indicated they are using a ticketing system (gure

9), suggesting that there’s still no good on-demand

access yet. It’s no wonder, therefore, that 89% of

respondents face allocation issues regularly (Figure

10), even though some of them (58%) claim to have

somewhat automatic access.

Only 11% said they rarely run GPU allocation issues, 13%

have allocation issues daily, and 40% face allocation

issues weekly.

On-Demand Access to

GPU Compute

Figure 10: Frequency of GPU/compute resource allocation issues.

89% face allocation issues regularly

40%

13%

24%

13%

11%

WeeklyDaily Bi-weekly Monthly Rarely

Figure 8: Availability of on-demand access to GPU compute.

No automatic

access

Automatically

accessed

Somewhat

automatically

accessed

58%

14%

28%

Figure 9: GPUs assignment w/o on-demand access.

Assigned by

manual request

Assigned

by ticketing

system

Assigned

statically to

specic users/

jobs

31%

18%

51%

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Figure 11: Plans to move AI Applications and infrastructure to the

cloud.

Figure 12: “Already on cloud” by access to GPU.

Automatic accessed

Somewhat automatic accessed

No automatic access

47%

46%

78%

51% of the respondents already have their applications

and infrastructure on the cloud, and 33% said they

were planning to move it to the cloud by the end of

2022.

When deep diving to see how those already in the

cloud dier based on their level of automatic access

to their GPU, we saw that of the 51% of companies

already on the cloud, the highest level of adoption

(78%) is by companies without automatic access.

All respondents indicated that they are either already

on the cloud or planning to move to the cloud either

this or next year, but according to these results, it

doesn’t necessarily solve their access to the GPU

problem, which is something that those who haven’t

moved to the cloud yet should be aware of.

Plans to Move AI Applications

and Infrastructure to the Cloud

33%

51%

16%

0% 0%

This yearAlready

on cloud

Next year In 5 years No plans

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Figure 13: Percentage of AI/ML Models Deployed in Production.

Weighted average: 37% of AI/ML models

88% deploying < 50%

21%

0% 0% 0%

32%

36%

12%

<10% 50-74%40-49%25-39%10-24% 75-90% 90%+

On average, 37% of AI/ML models are deployed in

production.

88% of respondents (compared with 77% in last year’s

survey) are deploying less than 50% of the models,

indicating that it’s even harder now to get things

into production, not just because of infrastructure

considerations but also business and organizational

ones as well.

It’s also why companies are investing so heavily in a

variety of dierent tools (as opposed to just one) to

get more AI into production, but as these results show,

there’s still a lot of room for improvement.

Percentage of AI/ML Models

Deployed in Production

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Figure 14: Main impediments to model deployment.

Scalability

Peormance

Technical

Resources

Approval to deploy

Privacy / legal issue

47%

46%

45%

42%

23%

The main impediments to actually deploying include

scalability (47%), peormance (46%), technical

(45%), and resources (42%). The fact that they were

all mentioned as a “main impediment” by a fairly

substantial poion of the respondents, shows once

again that there isn’t just one glaring impediment to

model deployment, but rather a multi-faceted one.

Main Impediments

to Model Deployment

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Demographics

Figure 17: Role

IT / Infrastructure Architect

Data Scientist / Researcher

MLOps / DevOps

Head of an AI / Deep Learning Intiative

Platforms Engineer

CTO

35%

33%

25%

Figure 15: Country

Germany

France

United

States

United

Kingdom

18%

16%

50%

R&D / Engineering

Data /

Data Science

DevOps

Figure 16: Depament

26%

13%

24%

37%

C-Suite

Analyst

Team Member

Manager

VP / Head

Director

Figure 18: Job Seniority

31%

17%

35%

Country, Depament, Role, Job Seniority

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Figure 19: Company size

42%

20%

40%

42%

16%

22%

<200 <10 GPUs201-500 11 - 50501-1K 51 - 1001,001-5K 100+5,001-10K >10K

Figure 20: GPU farm size

Company Size, GPU Farm Size

Amount of People

The 2023 State of AI Infrastructure Survey

eBook | Jan 2023

Run:ai's Atlas Platform brings cloud-like simplicity to

AI resource management - providing researchers

with on-demand access to pooled resources for any

AI workload. An innovative cloud-native operating

system - which includes a workload-aware scheduler

and an abstraction layer - helps IT simplify AI

For more information please visit us:

hps://www.run.ai/

implementation, increase team productivity, and

gain full utilization of expensive GPUs. Using run:ai,

companies streamline development, management,

and scaling of AI applications across any

infrastructure, including on-premises, edge and cloud.

About Run:ai