Server sizing requirements

Objectives

The objective of this document is to outline prescriptive guidance for Tachyon Platform customers implementing the solution, either on-premises servers, or in the Amazon AWS or Microsoft Azure cloud platforms.

There are many possible Hardware Servers or Cloud Instance types and storage options that customers could choose from, but the aim of this document is to recommend prescriptive guidance, and detailed configuration of the best performant implementation that customers, of varying client count sizes, should use to implement for the Tachyon Platform.

This document focuses heavily on the SQL Server implementation and required storage for the Tachyon Platform, as this is singly the most performance demanding part of the Tachyon Platform.

The default Tachyon Platform configuration assumes, and recommends, a single Tachyon Platform server and a separate remote SQL Server in a standard two server setup. The additional and complex implementations sections cover additional and optional scenarios for separate Switch and Background Channels servers (DMZ), plus the extreme scenario where multiple Tachyon Platform servers in a split, Response and Master Stack server configurations could be deployed, depending on the specific requirements for larger customers, for reasons of scale and higher availability.

Assumptions

The key assumption in this document for sizing guidelines is that the customer is using all components of the Tachyon Platform, that is, Tachyon Instructions, Guaranteed State, Experience, Patch Success, SLA Inventory and Nomad Content Distribution.

The sizing guidelines are based on extensive testing of all the Tachyon components at the different client count scales, using representative test data to simulate typical customer environments and their administrative and reporting usage.

Therefore, the following sizing guidelines should be used as the minimum hardware requirements for the specific numbers of managed clients in the respective size environments.

In terms of CPUs, the test environment used Intel Xeon E5-2687 v3 @3.1Hz, though any server-level CPU from 2016 should meet the performance needs for the cores specified in the following tables. Additional or newer and faster speed CPUs will improve performance, and the Tachyon platform will SQL Server will utilize all the CPU resources available.
In terms of data sizing, actual customer data and individual requirements will vary greatly. The data sizing is provided only as general guidance and assumes that some amount of storage will remain as free space, most of the time, when not at peak loads.

Performance load modelling

To simulate 100,000s of actual Tachyon Agents, 1E has developed a Load Generation tool (loadgen) that maintains the same number of persistent connections and can respond with the same data responses as would real-world actual clients.

In reality, loadgen generates more data and creates greater data storms at the Tachyon platform than, often offline, and latent Agent responses would generate, so can be considered as producing a more extreme and worst-case scenario performance loads on the Tachyon platform.

The table below details data and performance loads used to create the sizing guide.

Tachyon Component	Feature	Assumptions
Explorer	Instructions	An average of 1,000 instructions per day (including Synchronous instructions). Peak loads of 10 simultaneous small and some large instructions, targeted at both small and large device sets.
Guaranteed State	Policies	Around 20 deployed policies totaling 300 active rules and remediations. An average of up to 100 rule state changes per device per day.
Experience	Events	Each device reports an average of 300 Device and Software Performance, User Interactions and Windows events per day.
SLA Inventory	Patch Success Software Inventory	At least two connectors are configured (SCCM or BigFix for Software inventory) and a Tachyon connector to gather patch data. An average of 200 software installations and 100 patches per device and up to 50,000 distinct software products, per 50k devices.
Nomad Content Delivery	Downloads	For every 50,000 clients there are 100 sites containing around 2,500 distinct subnets. At peak loads (that is, during a critical patch deployment) there could be up to 20,000 content registrations, content download requests, and responses, per minute.

An overall assumption is that long duration and high impact batch processing operations, such as SLA inventory sync consolidations and Experience cube processing, are run primarily out of business hours (typically overnight) when there would be little potential impact on other Tachyon platform traffic and administrative interactive query and analytical reporting.

Using AI Powered Auto Curation with SLA Inventory Sync

If this feature is enabled, it requires additional memory during processing inventory, based on the total number of distinct software titles found in a specific customer environment. Please refer to the AI Powered Auto-curation page, which explains how to calculate the memory requirements. Based on the amount of additional memory needed for AI Auto-curation, it may be necessary to increase the amount of required memory at the Tachyon Platform server in the following sizing tables, in order to utilize this feature.

On-premises installation

Tachyon is recommended to be installed on a dedicated server, with a separate dedicated SQL Server, and can be installed on either virtual server or physical hardware.

For every 50,000 clients, Tachyon requires a separate instance of the Tachyon Switch component, which requires a dedicated Network Interface (NIC) for each switch instance. In a virtualized environment, this should a dedicated vNIC that preferably maps to a dedicated physical NIC, on the host.

Tachyon is a high-intense database application and so requires a highly performant Microsoft SQL Server setup, with fast storage, and this is the most important component to size correctly. Storage could be presented locally, but it is expected that the more likely scenario is that this is presented from a customer Enterprise Storage Area Network (SAN), or cloud-based managed storage.

As per Microsoft SQL Server best practice, 1E recommends to provision at least three separate disk volumes for SQL Data, Logs and TempDB. These volumes may be made up from multiple dedicated disks, striped in a RAID Volume for both resilience and performance, for example RAID 10, depending on the standard operational configuration of the customer’s on-premises storage sub-systems.

Data and log drives disk volumes should be formatted to use 64 KB allocation unit size. It is assumed the customer is using fast SSD based storage to achieve the necessary disk throughput in MB/s and IOPS, at least 6Gb/s SATA, but preferably 12Gb/s SAS or NVMe drives at higher scale.

Microsoft SQL Server should be configured according to the Microsoft best practice documentation. The default installation of SQL Server 2017 or later, will make automatic configuration settings for memory, TempDB and processing parallelism. However, the best single index to Microsoft recommendations for SQL Server can be found at https://docs.microsoft.com/en-us/sql/relational-databases/performance/performance-center-for-sql-server-database-engine-and-azure-sql-database?view=sql-server-ver15.

In addition to SQL Server databases, Tachyon platform features use a SQL Server Analysis Services (SSAS) instance, in multi-dimensional mode, that can be co-located with SQL Server Database installation, using the same data and log drives. The disk space required for the SSAS cubes is typically only 10’s of MBs, so can largely be ignored in terms of disk space sizing.

SSAS settings and configuration can be left at the installation defaults, but additional best practice performance optimizations for SSAS can be found in the following Microsoft guidance at http://download.microsoft.com/download/d/2/0/d20e1c5f-72ea-4505-9f26-fef9550efd44/analysis%20services%20molap%20performance%20guide%20for%20sql%20server%202012%20and%202014.docx

Note: One setting to make over and above the SQL Server installation defaults is to set Maximum SQL Server Memory to a value that reserves some memory at the server for the Operating System itself, and any additional running processes, like SQL Server Analysis Services (SSAS).

A reasonable rule of thumb to use, for a dedicated SQL Server configuration as defined in the following server tables, would be to set Maximum SQL Server Memory to 85% of total memory of the server and then for SSAS set Memory Limit Low: 90% and SSAS Memory Limit High: 95%. Both of these settings can be set at the Server properties via SQL Server Management Studio.

Small server sizing (<10K clients, on-premises and cloud)

For small customers with less than 10,000 seat counts and Proof-of-Concept installations, it is allowable to install all the Tachyon server components and MS SQL to be installed onto a single server for simplicity - though it should be noted that this may mean higher SQL per core licensing costs than a split two server installation.

In a single server, as per Microsoft SQL Server best practice, it is still recommended to have a minimum of 3 separate disks (OS disk, SQL DB and Logs and TempDB) to gain optimum disk performance.)

The following table details the VM and storage sizing for both on-premises and cloud platforms.

Platform	On-Premises	Microsoft Azure	Amazon AWS
Devices	Up to 10,000	Up to 10,000	Up to 10,000
Server Type	Physical or Virtual	Standard E4ds v4	R5.xlarge
CPU Cores	4	4	4
RAM	32 GB	32 GB	32 GB
Tachyon Switches	1	1	1
NICs	1	1	1
Disks
OS Drive	64 GB	64 GB	64 GB
DB/logs Disk Size	500 GB	500 GB	500 GB
MBs/IOPS	250/15,000	200/5,000	200/5,000
TempDB Size	150 GB	150 GB	150 GB
MBs/IOPS	250/15,000	242/38,500	200/5,000

Notes:

For a combined single-server installation, Maximum SQL Server Memory max memory should be capped at 50% of available memory, to ensure Tachyon application has its own, dedicated pool of memory.
IOPS is calculated at standard 16k Block Size, although SQL Data volumes/disks should be formatted in Windows at 64k allocation units, according to Microsoft SQL Server best practice guidance.
For Cloud platforms, to achieve the required disk throughput, the above assumes using premium storage SSD based disks, that is, for Azure, Premium SSDs and AWS, EBS Gp2 volumes, or better.

On-premises installation design patterns

The following table details the VM and storage sizing for different customer sizes.

	Medium 1	Medium 2	Large 1	Large 2	Large 3
Devices	25,000	50,000	100,000	200,000	500,000
Tachyon Server
CPU Cores	4	8	16	32	48
RAM	16 GB	32 GB	64 GB	128 GB	192 GB
Tachyon Switches	1	1	2	4	10
NICs	2	2	3	5	11
Remote SQL Server
CPU Cores	4	8	16	24	32
RAM	32 GB	64 GB	128 GB	256 GB	512 GB
Disks
DB Size	500 GB	1,000 GB	2,000 GB	4,000 GB	8,000 GB
MBs/IOPS	250/15,000	500/30,000	1,000/60,000	2,000/120,000	4,000/240,000
Logs Size	100 GB	200 GB	500 GB	1,000 GB	2,000 GB
MBs/IOPS	170/10,000	250/15,000	500/30,000	1,000/60,000	2,000/120,000
TempDB Size	150 GB	300 GB	600 GB	1,200 GB	2,000 GB
MBs/IOPS	250/15,000	500/30,000	1,000/60,000	2,000/120,000	4,000/240,000

Notes:

For customers with inter-meaning seats count, they should choose the closest higher number. For example, a 150,000-seat implementation should be treated as "Large 2" rather than "Large 1".
IOPS is calculated at standard 16k Block Size, although SQL Volumes should have a 64k block size, and disks/volumes should be formatted in Windows at 64k allocation units, according to Microsoft best practice guidance.

Network considerations

A server hosting a Tachyon Response Stack requires a dedicated network interface for each Switch, and for the connection to the remote SQL Server, to keep incoming traffic from clients separate from the outgoing traffic to the Response and other Tachyon SQL databases. It is expected that this Server-Server network traffic, would be over 10Gb or greater Data Center backbone network.

Accelerated Networking which provides enhanced NIC performance, using Receive Side Scaling (RSS) should be enabled on all NICs. Please refer to https://docs.microsoft.com/en-us/windows-hardware/drivers/network/introduction-to-receive-side-scaling.

Also, its better performant to increase the transmit (TX) and receive (RX) buffer sizes to their maximum under the Windows network adaptor advanced properties.

Microsoft Azure

Out of scope

This document only focuses on Tachyon Server and SQL Server on individual Azure VMs and configurations using Azure Premium storage.

It doesn't consider using Microsoft SQL as part of Azure Platform as a Service (PAAS) offerings, either SQL Server Managed Instances or Native Azure SQL (https://docs.microsoft.com/en-us/azure/azure-sql/azure-sql-iaas-vs-paas-what-is-overview) as these solutions are not currently supported by 1E as a means to implement SQL Server for the 1E Tachyon Platform.

In addition, it does not consider using Azure instances that does not rely on Azure premium storage, but have local NVMe drives, like the Lsv2-series. Although these instance types have very high performing storage and data transfer bandwidth, the NVMe disks are ephemeral or non-persistent, so only practical for single SQL Server instance for TempDB.

In the future, 1E plan to provide guidance in using non-persistent storage solutions or SQL Business Critical Managed Instances, as part of a SQL Server Always on Availability Group cluster, that would provide storage resilience and redundancy as described in https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server?view=sql-server-ver15. However, since these solutions require a cluster of minimum 3 nodes, it should be noted that these solutions would be much more expensive to implement than individual VMs.

Azure VM Selection

This document should be read in conjunction with the Azure documentation, especially with the guidance on Maximizing Microsoft SQL Server Performance with on Azure VMs https://docs.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/performance-guidelines-best-practices.

Azure Premium storage recommendations for SQL Server workloads are also detailed more completely in https://docs.microsoft.com/en-us/azure/virtual-machines/premium-storage-performance#optimize-IOPS-throughput-and-latency-at-a-glance.

Based on these factors, 1E recommend using Azure Dsv4 VMs for the Tachyon Platform Server and Edsv4-series VMs for SQL Server, to achieve the optimum ratio of vCPU and memory count for the separate requirements.

Dsv4 and Edsv4-series sizes run on the Intel® Xeon® Platinum 8000 series (Cascade Lake) processors. In addition, the Edsv4 virtual machine sizes feature up to 504 GiB of RAM, in addition to fast and large local SSD storage (up to 2,400 GiB). These virtual machines are ideal for memory-intensive enterprise applications and applications that benefit from low latency, high-speed local storage in the following specifications.

Dsv4 Series
Size	vCPU	Memory: GiB	Temp storage (SSD) GiB	Max data disks	Max uncached disk throughput: IOPS/MBps	Max NICs	Expected Network bandwidth (Mbps)
Standard_D2s_v4	2	8	0	4	3200/48	2	1000
Standard_D4s_v4	4	16	0	8	6400/96	2	2000
Standard_D8s_v4	8	32	0	16	12800/192	4	4000
Standard_D16s_v4	16	64	0	32	25600/384	8	8000
Standard_D32s_v4	32	128	0	32	51200/768	8	16000
Standard_D48s_v4	48	192	0	32	76800/1152	8	24000
Standard_D64s_v4	64	256	0	32	80000/1200	8	30000

Azure VM constrained core CPU options

At some VM sizes, it is possible to reduce the CPU count to therefore lower SQL license requirements, whilst maintaining the higher storage throughput of the VM https://docs.microsoft.com/en-us/azure/virtual-machines/constrained-vcpu?toc=/azure/virtual-machines/linux/toc.json&bc=/azure/virtual-machines/linux/breadcrumb/toc.json.

Therefore, for the Large 3 sizing, higher specification constrained core SQL VMs are chosen, specifically to get the higher required storage throughput, though with the same VM and OS pricing, but with half the actual presented vCPU core count to minimize SQL license costs.

Azure Premium storage selection

For any solution based on Microsoft SQL Server, storage throughput is normally the major bottleneck to performance, and the Tachyon Platform is no exception to this.

The maximum storage throughput in terms of MB/s and IOPS for specific Azure EdsV4 VMs is detailed in the table below, and this will determine the maximum equivalent MBs/IOPS for the selected Azure Premium storage volumes.

For example, if you attach a two P30 disks volume (200MB/s provisioned throughput each) to an E16ds_v4 VM, you reach the instance limit of 384 MB/s before you reach the volume limit of 400 MB/s total throughput.

Edsv4 Series
VM Size	Temp storage (SSD) GiB	Max data disks	Max cached and temp storage throughput: IOPS/MBs (cache size in GiB)	Max un-cached disk throughput: IOPS/MBs
Standard_E2ds_v4	75	4	19000/120(50)	3200/48
Standard_E4ds_v4	150	8	38500/242(100)	6400/96
Standard_E8ds_v4	300	16	77000/485(200)	12800/192
Standard_E16ds_v4	600	32	154000/968(400)	25600/384
Standard_E20ds_v4	750	32	193000/1211(500)	32000/480
Standard_E32ds_v4	1200	32	308000/1936(800)	51200/768
Standard_E48ds_v4	1800	32	462000/2904(1200)	76800/1152
Standard_E64ds_v4 ¹	2400	32	615000/3872(1600)	80000/1200

The recommended Azure Premium storage type are Premium SSDs as these deliver the best high-performance and low-latency disk support for virtual machines at the lowest storage cost. Please refer to https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types?toc=/azure/virtual-machines/linux/toc.json&bc=/azure/virtual-machines/linux/breadcrumb/toc.json#premium-ssd.

Another option would be to use Azure Ultra disks, but these are much more expensive and do not benefit from read caching. Though, some additional benefits of Ultra disks include the ability to dynamically change the performance of the disk, without the need to restart the VM.

As per Microsoft SQL Server best practice, 1E recommends to provision at least three separate Azure disk volumes for SQL Data, Logs and TempDB. In the Azure Edsv4 virtual machine, it is possible to use the fast and large local SSD storage for TempDB, as this does not need to be persistent and is re-created automatically every time SQL Server starts.

The Data volumes may be made up from multiple Premium disks (in the higher spec Configurations), striped in a Basic Array using Windows Storage Spaces. This configuration allows for the local NVMe SSDs to act as a read-ahead cache for these striped volumes and gain better read performance. Please refer to https://docs.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/performance-guidelines-best-practices-storage.

Notes:

If the VM is created using the Azure SQL VM template, then the relevant disks will be automatically created as Windows Storage space volumes, with read-ahead caching enabled.
Overall, these storage volumes should have the combined MB/s throughput, equal to or above the total MB/s supported by the given VM size, to gain maximum storage performance.

Azure prescriptive design patterns

The following table details the Azure VM and storage sizing for different customer sizes.

	Medium 1	Medium 2	Large 1	Large 2	Large 3
Devices	25,000	50,000	100,000	200,000	500,000
Tachyon Server
VM type	Std_D4s_v4	Std_D8s_v4	Std_D16s_v4	Std_D32s_v4	Std_D48s_v4
CPU Cores	4	8	16	32	48
RAM	16 GB	32 GB	64 GB	128 GB	192 GB
Tachyon Switches	1	1	2	4	10
NICs	2	2	3	5	11*
Remote SQL Server
VM Type	Std E4ds v4	Std_E8ds v4	Std E16ds v4	Std_E20ds_v4	Std_E64-32ds_v4
CPU Cores	4	8	16	20	32
RAM	32 GB	64 GB	128 GB	160 GB	504 GB
Max MB/s	96	192	384	480	1,200
Max IOPS	6,400	12,800	25,600	32,000	80,00
DB Size	500 GB	1,000 GB	2,000 GB	4,000 GB	8,000 GB
MBs/IOPS	170/3,500	340/7,000	400/10,000	800/20,000	1,600/40,000
Logs Size	100 GB	200 GB	500 GB	1,000 GB	2,000 GB
MBs/IOPS	170/3,500	170/3,000	170/3,000	200/5,000	250/7,500
TempDB Size	150 GB	300 GB	600 GB	750 GB	2,000 GB
MBs/IOPS	242/38,500	485/77,000	968/154,000	1,211/193,000	3,872/615,000

Notes:

For customers with inter-meaning seats count, they should choose the closest higher number. For example, a 150,000-seat implementation should be treated as "Large 2" rather than "Large 1".
As noted above, to reach the required throughput in MBs, for the data drive especially, multiple P20, P30 or P40 Premium disks should be used and these disks, striped using Windows Storage space to present a single volume.
The high MBs/IOPS for the TempDB drives is because they are using the built-in high throughput of the fast, local SSD based (non-persistent) storage that the Edsv4-series provides.

Azure network considerations

A server hosting a Tachyon Response Stack requires a dedicated network interface for each Switch, and for the connection to the remote SQL Server instance used for the Responses database to keep incoming traffic from clients separate from the outgoing traffic to the Response database.

Accelerated Networking which provides enhanced NIC performance using SR-IOV should be enabled on the SQL VM Network Interface and Platform VM Network Interface that communicates with it. Detailed steps on how to configure this are given here https://docs.microsoft.com/en-us/azure/virtual-network/create-vm-accelerated-networking-powershell.

Also increase the transmit (TX) and receive (RX) buffer sizes to their maximum under the network adaptor advanced properties.

If connecting to an Azure based Tachyon Switch via external public Azure IP addresses or Azure Load Balancers you will need to extend the default TCP timeout from 4 minutes to 15 minutes. Detailed steps on how to do this can be found here https://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer.

Amazon AWS

Out of scope

This document only focuses on AWS elastic cloud (EC2) instances and configurations using AWS Elastic Block Storage (EBS).

It does not consider using Microsoft SQL as part of AWS Platform as a Service (PAAS) offerings such as Amazon Relational Database Service (RDS) (https://aws.amazon.com/rds/) as this not currently supported by 1E as a means to implement SQL Server for the 1E Tachyon Platform.

In addition, it does not consider using AWS instances that does not use the EBS platform but have local NVMe based storage such as I3en and R5d instance types.

Although these instance Types have very high performing storage and data transfer bandwidth, the NVMe disks are ephemeral or non-persistent, so only practical for SQL Server use for TempDB.

In the future, 1E plan to provide guidance in using non-persistent storage solutions as part of a SQL Server Always on Availability Group cluster, that provides storage resilience and redundancy as described in https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server?view=sql-server-ver15. However, since these solutions require a cluster of minimum 3 nodes, it should be noted these solutions would be much more expensive to implement.

AWS instance selection

This Document should be read in conjunction with the AWS documentation, especially with guidance on Maximizing Microsoft SQL Server Performance with Amazon EBS https://aws.amazon.com/blogs/storage/maximizing-microsoft-sql-server-performance-with-amazon-ebs.

In recommending the desired AWS instance type and size 1E, has followed the following guidance.

Instances are based on the latest AWS Nitro Systems https://aws.amazon.com/ec2/nitro
Amazon EBS Optimized systems were selected, so that these can have the best possible storage throughput performance in terms of MB/s and IOPS https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html

Based on these factors 1E recommend using the EC2 M5 instances for Tachyon Platform server and R5 instances for SQL Server, to get the optimum ratio of vCPU and memory count.

M5/R5 instances have 3.1 GHz Intel Xeon® Platinum 8000 series processors with new Intel Advanced Vector Extension (AVX-512) instruction set and following specifications.

Instance Size	vCPU	Memory (GiB)	Instance Storage(GiB)	Network Bandwidth (Gbps)	EBS Bandwidth (Mbps)
m5.large	2	8	EBS-Only	Up to 10	Up to 4,750
m5.xlarge	4	16	EBS-Only	Up to 10	Up to 4,750
m5.2xlarge	8	32	EBS-Only	Up to 10	Up to 4,750
m5.4xlarge	16	64	EBS-Only	Up to 10	4,750
m5.8xlarge	32	128	EBS Only	10	6,800
m5.12xlarge	48	192	EBS-Only	12	9,500
m5.16xlarge	64	256	EBS Only	20	13,600
m5.24xlarge	96	384	EBS-Only	25	19,000

AWS instance CPU options

The above table shows the default # of vCPUs provisioned for a specific AWS instance type at creation time.

Amazon EC2 instances support multithreading, which enables multiple threads to run concurrently on a single CPU core. By disabling hyper-threading, it is possible to get less vCPUs (and therefore lower SQL license requirements) whilst maintaining the higher storage throughput. Please refer to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-optimize-cpu.html.

AWS EBS storage selection

For any solution based on Microsoft SQL Server, storage throughput is normally the major bottleneck to performance, and the Tachyon Platform is no exception to this.

The maximum storage throughput in terms of MB/s and IOPS for specific AWS M5 instances is detailed in the table below and this will determine the maximum equivalent IOPS for the selected AWS storage volumes.

For example, if you attach a single 20,000-IOPS volume to an r5.4xlarge instance, you reach the instance limit of 18,750 IOPS before you reach the volume limit of 20,000 IOPS.

Instance size	Maximum storage bandwidth (Mbps)	Maximum throughput (MB/s, 128 KiB I/O)	Maximum IOPS (16 KiB I/O)
r5.large	4,750	593.75	18,750
r5.xlarge	4,750	593.75	18,750
r5.2xlarge	4,750	593.75	18,750
r5.4xlarge	4,750	593.75	18,750
r5.8xlarge	6,800	850	30,000
r5.12xlarge	9,500	1,187.5	40,000
r5.16xlarge	13,600	1,700	60,000
r5.24xlarge	19,000	2,375	80,000

The recommended AWS EBS storage type is Provisioned IOPS SSD (io1 and io2) volumes. These (io1 and io2) SSD volumes are designed to meet the needs of I/O-intensive workloads, particularly database workloads, that are sensitive to storage performance and consistency.

Unlike General Purpose EBS Storage (gp2), which uses a bucket and credit model to calculate performance, io1 and io2 volumes allow you to specify a consistent IOPS rate when you create volumes, and Amazon EBS delivers the provisioned performance 99.9 percent of the time. For more details see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html.

As per Microsoft SQL Server best practice, 1E recommends to provision at least three separate AWS disk volumes for SQL Data, Logs and TempDB. These volumes may be made up from multiple Io1 disks, striped in a Basic Array using Windows Storage Spaces to create a single higher performance volume, with the combined IOPS of all individual disks in the Storage Pool. For an example of this, see https://d1.awsstatic.com/whitepapers/maximizing-microsoft-sql-using-ec2-nvme-instance-store.pdf.

The configured volumes should have the combined MBs/IOPS throughput, equal to or above, the total MBs/IOPS supported by the given instance type, to gain maximum storage throughput performance possible for the given instance type.

AWS EC2 SQL prescriptive design patterns

The following table details the AWS Instance and storage sizing for different customer sizes.

	Medium 1	Medium 2	Large 1	Large 2	Large 3
Max devices	25,000	50,000	100,000	200,000	500,000
Tachyon Server
AWS Instance	M5.xLarge	M5.2xlarge	M5.4xlarge	M5.8xlarge	M5.12xLarge
CPU Cores	4	8	16	32	48
RAM	16 GB	32 GB	64 GB	128 GB	192 GB
Tachyon Switches	1	1	2	4	10
NICs	2	2	3	5	11
Remote SQL Server
AWS Instance	R5.xlarge	R5.2xlarge	R5.4xlarge	R5.12xlarge	R5.16xLarge
CPU Cores	4	8	16	24*	32*
RAM	32 GB	64 GB	128 GB	384 GB	512 GB
Max MB/s	593.75	593.75	593.75	1,187.5	1,700
Max IOPS	18,750	18,750	18,750	40,000	60,000
DB Disk Size	500 GB	1,000 GB	2,000 GB	4,000 GB	8,000 GB
DB Disk IOPS	10,000	10,000	12,000	24,000	48,000
Logs Disk Size	150 GB	300 GB	600 GB	1,200 GB	2,000 GB
Log Disk IOPS	5,000	5,000	6,000	8,000	16,000
TempDB Size	150 GB	300 GB	600 GB	1,200 GB	2,000 GB
TempDB IOPS	6,000	6,000	8,000	10,000	20,000

Notes:

Customers with inter-meaning seats count, should choose the closest higher number. For example, a 150,000-seat implementation should be treated as "Large 2" rather than "Large 1".
The Large 2 and 3 (*) uses a larger Instance size to get the necessary storage throughput but assumes disabling hyper-threading in the instance to reduce the vCPU count, and therefore, SQL license costs.

AWS network considerations

Enhanced networking (SR-IOV) must be enabled on both NICs which should be the default. Also increase the transmit (TX) and receive (RX) buffer sizes to their maximum under the network adaptor advanced properties. Detailed steps on how to verify this are detailed here https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/sriov-networking.html.

Additional and complex configurations

Tachyon Switch and Background Channel servers (DMZ)

In some customer environments, with network segmentation, it may be beneficial to separate the switch infrastructure servers from the Tachyon Platform component servers so that clients connect to a more network local server, whilst the platform and SQL Servers reside in a more central datacenter subnet.

This same model is true for a DMZ environment, where remote clients only connect to a Tachyon Switch and background components in an intentionally separated DMZ environment and subnet.

The Tachyon install path (DMZ installation) can be used to install the Tachyon components for required for client connectivity only, that is, the Tachyon Switch and Background Channel (BGC).

Switch and Background channel only servers will have lesser memory, CPU core and storage requirements than a full Tachyon Platform server, but the following requirements should be noted:

As stated above, a new instance of the Tachyon Switch is required for every 50k clients with a dedicated NIC of at least 1GBps speed. This network IP may be shared with the BGC.
To support additional client connection count numbers, for every 50k clients, a separate dedicated NIC and additional instance of the Tachyon Switch and separate IP address will be required.
An additional internal facing interface is required for outgoing response traffic from Tachyon Switch(s) on the Switch/DMZ Server to the internal Tachyon Response Stack. This also should have a minimum speed of 1Gbps, or higher if hosting multiple switch instances.

The following table details the recommended server specification, for on-premises and cloud.

Platform	On-Premises	Microsoft Azure	Amazon AWS
Devices	Up to 50,000	Up to 50,000	Up to 50,000
Server Type	Physical or Virtual	Std_D4s_v4	M5.xlarge
CPU Cores	4	4	4
RAM	16 GB	16 GB	16 GB
Tachyon Switches	1	1	1
NICs	2	2	2
Devices	Up to 100,000	Up to 100,000	Up to 100,000
Server Type	Physical or Virtual	Std_D8s_v4	M5.2xlarge
CPU Cores	8	8	8
RAM	32 GB	32 GB	32 GB
Tachyon Switches	2	2	2
NICs	3	3	3
Devices	Up to 200,000	Up to 200,000	Up to 200,000
Server Type	Physical or Virtual	Std_D16s_v4	M5.4xlarge
CPU Cores	16	16	16
RAM	64 GB	64 GB	64 GB
Tachyon Switches	4	4	4
NICs	5	5	5

Separate Master and Response Stacks

The standard Tachyon Installation assumes all components are installed on a single server, with databases hosted on a remote dedicated SQL Server.

However, it is also supported to split some of the Tachyon Platform between multiple servers - in this configuration creating a Tachyon Master Stack server (Coordinator Service, Consumer API, Experience and SLA components) and one or more Response Stack servers (Switch, Background Channel and Core component).

The Tachyon Setup installer has different installation option paths, to support these installation types, installing first a "Master Stack" and then separate "Response Stack" servers.

A split server configuration would only really be practical in large scale environments, but does have a couple of advantages over installing all the Tachyon components on a single server: -

Spread the H/W requirements into multiple, smaller servers: If the required VM sizes (CPU and Memory or #of NICs) is greater than the capabilities of the virtual host, then multiple smaller sized VMs may be more practical in some environments.

Multiple Response Stacks can provide higher performance throughput than a single server, but at the cost of increased total vCPU and Ram for the platform, across multiple VMs.

Provide some Resilience and fault-tolerance for the Platform: Multiple and redundant Response Stack servers, would allow for the failure of a single Response Server VM, where all clients fall over to the remaining server and it's Switch instances in single VM failure.

To achieve resilience, it would assume the # of Switches configured and total capacity of a single Response server, is matched to the total load of all required client connections. This may mean at least three response server installations are required, so that two remaining response servers can handle all the required number of active connections, if one VM fails.

Redundant Response Servers only provide resilience in the failure of one of the redundant Response server VMs – if the separate and single Master Stack VM (or SQL Server) fails, then all the Tachyon Platform solution would be unavailable.

Overall resilience and high-availability would best be enabled using the standard high-availability features built-in to the chosen virtualization platform and at the storage system level (on-premises), or if in the cloud, through their specific high-availability and disaster recovery options.

The rule that for every 50,000 clients, Tachyon requires a separate instance of the Switch component, which requires a dedicated Network Interface (NIC), still applies for Response Servers, as they will host switch instance(s) alongside the Tachyon core component.

If deploying separate Response and Master Stack servers, it is still recommended installing all Tachyon Databases on a separate and dedicated SQL Server, so as not to mix a SQL Server installation with IIS other and web server components, as per Microsoft best practice. Therefore, the VM sizing guidelines, disk throughput and storage requirements for SQL Server VM will be identical, if either a single or split platform server configuration. Please refer to the above sections for SQL Server sizing, using specific total client counts required for the environment.

Response server design patterns

With response servers, there are various configuration sizes that could be used to support the required maximum number of active clients, depending on required resilience or redundancy.

The basic formula for sizing a separate Response server is to allocate at least 8x vCPU cores and 16GB of memory for a maximum of 50,000 active clients reporting to that server.

Response Servers
Maximum Clients	25,000	50,000	100,000	200,000
CPU Cores	4	8	16	32
RAM	8 GB	16 GB	32 GB	64 GB
Tachyon Switches	1	1	2	4
NICs	2	2	3	5

Note: When planning for resiliency, there should be enough capacity in the individual Response Servers to allow for at least 1 VM server to fail, but the remaining VMs to have the capacity to support the total number of required connections.

Master Stack server design patterns

The basic formula for sizing a separate Master Stack server is to allocate at least 2x vCPU cores and 16GB of memory for every 50,000 active clients in the environment.

The following table details the VM CPU, Memory and NIC requirements for example individual Response servers, supporting a max total number of active clients in the environment.

Master Stack Servers
Maximum Clients	50,000	100,000	200,000	400,000
CPU Cores	2	4	8	16
RAM	16 GB	32 GB	64 GB	128 GB
NICs	1	1	1	1

Note: Although the multiple (and potentially redundant) Response servers can support more clients, the total number of active clients across all the Response servers will always stay the same.

How we measure disk performance

There are a number of third party tools to measure disk and storage performance but the most commonly used and referenced by various hardware providers, is CrystalDiskMark (https://crystalmark.info/en/software/crystaldiskmark).

An example of the output from CrystalDiskMark is shown below, where the drives consist of an array of Samsung Pro 850, 6 Gb/s SATA SSDs.

Under the hood, CrystalDiskMark uses the Microsoft tool Diskspd (https://github.com/microsoft/diskspd/releases) which can be useful to run on its own to get a simple and reproducible way to measure different server and VM disk subsystem performance.

Use the following command line to test each of the relevant SQL volumes, by drive letter, in turn.

Diskspd -b64k –d120 –o32 –t4 –h –r –w25 –L –c2G G:\TestLoad.dat > GDisk_resultdetails.txt

Once complete, review the output in the created results file GDisk_resultdetails.txt for the Total IO thread section. In the following example, the total throughput in MiB/s is measured at 931.58.

Tachyon is more akin to a Data Warehouse type SQL Application (large sequential Writes and Reads) and not an OLTP type system (millions of small random IO requests). Therefore, overall throughput (MB/s) is more important than just IO alone. Also, when comparing IOPS in the sizing tables above (16k), and Diskspd at 64K block size, x4 to equate the values.

Table of references

The following tables lists external reference documents, mentioned in the above sections.

Section	Topic	Reference
General	SQL Server Performance	https://docs.microsoft.com/en-us/sql/relational-databases/performance/performance-center-for-sql-server-database-engine-and-azure-sql-database?view=sql-server-ver15
	SSAS performance guide	http://download.microsoft.com/download/d/2/0/d20e1c5f-72ea-4505-9f26-fef9550efd44/analysis%20services%20molap%20performance%20guide%20for%20sql%20server%202012%20and%202014.docx
	Network performance (RSS)	https://docs.microsoft.com/en-us/windows-hardware/drivers/network/introduction-to-receive-side-scaling
	What is an Always On availability group?	https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server?view=sql-server-ver15
	CrystalDiskMark and Diskspd	https://crystalmark.info/en/software/crystaldiskmark and https://github.com/microsoft/diskspd/releases
Azure	What is Azure SQL?	https://docs.microsoft.com/en-us/azure/azure-sql/azure-sql-iaas-vs-paas-what-is-overview
	Best practices for SQL Server on Azure VMs	https://docs.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/performance-guidelines-best-practices
	Azure Storage for SQL VMs	https://docs.microsoft.com/en-us/azure/virtual-machines/premium-storage-performance#optimize-IOPS-throughput-and-latency-at-a-glance
	Azure Constrained Core VMs	https://docs.microsoft.com/en-us/azure/virtual-machines/constrained-vcpu?toc=/azure/virtual-machines/linux/toc.json&bc=/azure/virtual-machines/linux/breadcrumb/toc.json
	Azure Premium Storage	https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types?toc=/azure/virtual-machines/linux/toc.json&bc=/azure/virtual-machines/linux/breadcrumb/toc.json#premium-ssd
	Windows Storage Spaces	https://docs.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/performance-guidelines-best-practices-storage
	Azure Accelerated Networking	https://docs.microsoft.com/en-us/azure/virtual-network/create-vm-accelerated-networking-powershell
	Azure Load Balancer	https://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer
AWS	AWS SQL Server Best Practices	https://aws.amazon.com/blogs/storage/maximizing-microsoft-sql-server-performance-with-amazon-ebs
	AWS Nitro Systems	https://aws.amazon.com/ec2/nitro
	AWS Storage optimized instances	https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html and https://d1.awsstatic.com/whitepapers/maximizing-microsoft-sql-using-ec2-nvme-instance-store.pdf
	AWS Constrained core Instances	https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-optimize-cpu.html
	AWS EBS volume types	https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html
	AWS Accelerated Networking	https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/sriov-networking.html

In this section:

1E 8.1 (on-premises)

Server sizing requirements

Objectives

Assumptions

Performance load modelling

Using AI Powered Auto Curation with SLA Inventory Sync

On-premises installation

Small server sizing (<10K clients, on-premises and cloud)

On-premises installation design patterns

Network considerations

Microsoft Azure

Out of scope

Azure VM Selection

Azure VM constrained core CPU options

Azure Premium storage selection

Azure prescriptive design patterns

Azure network considerations

Amazon AWS

Out of scope

AWS instance selection

AWS instance CPU options

AWS EBS storage selection

AWS EC2 SQL prescriptive design patterns

AWS network considerations

Additional and complex configurations

Tachyon Switch and Background Channel servers (DMZ)

Separate Master and Response Stacks

Response server design patterns

Master Stack server design patterns

How we measure disk performance

Table of references

Search results