TLDR

  1. Bare metal server link aggregation configuration
  2. Bare metal server link aggregation + VLAN configuration
  3. Bare metal server link aggregation + VLAN + Bridge configuration
  4. Bare metal server derived practices

Introduction

It has been a while since I last wrote a blog post.

Two reasons kept me from writing. First, OpenAI’s ChatGPT already provides excellent answers to many technical questions, making my repetition seem redundant. Second, exhaustion left me unable to think clearly – I did not know what to share, and therefore could not find the motivation to write.

Sitting in a bookstore, away from the hustle and bustle of the Lunar New Year homecoming crowds, with coffee, music, and the sunset (soon to arrive). A brief moment of beauty felt after numbness gave me the motivation again. I thought I would just document some problems encountered at work – perhaps someone else might run into them too.

Internal Network Bandwidth Bottleneck

Our company runs services based on open-source text-to-text (and text-to-image) models. The model data ranges from 60 GB on the small end to over 100 GB on the large end. Models are stored on a separate server and served to multiple internal GPU servers via NFS. At gigabit switch speeds, reading 100 GB of model data takes 13.5 minutes. Starting all 10 GPU services would take 135 minutes (over 2 hours).

To avoid slow startup caused by the NFS bottleneck, link aggregation (LACP) can be used. Link aggregation combines two NICs into one, providing the combined bandwidth of both. For example, if the NFS server has two gigabit ports, link aggregation gives it 2 Gbps of internal bandwidth, cutting total startup time in half when multiple services start simultaneously. (This is just an illustrative example – the actual bare metal servers have a single 25G port each.)

Below is the specific systemd-networkd configuration, assuming two physical interfaces: eth0 and eth1.

# /etc/systemd/network/1-eth0.network
[Match]
Name=eth0

[Link]
MTUBytes=9000

[Network]
Bond=bond0
# /etc/systemd/network/1-eth1.network
[Match]
Name=eth1

[Link]
MTUBytes=9000

[Network]
Bond=bond0
# /etc/systemd/network/5-bond0.netdev
[NetDev]
Name=bond0
Kind=bond

[Bond]
Mode=802.3ad
TransmitHashPolicy=layer3+4
MIIMonitorSec=0.1s
LACPTransmitRate=fast
UpDelaySec=0.2s
DownDelaySec=0.2s
# /etc/systemd/network/5-bond0.network
[Match]
Name=bond0

[Link]
MTUBytes=9000

[DHCPv4]
UseDNS=false

[DHCPv6]
UseDNS=false

[IPv6AcceptRA]
UseDNS=false

[Network]
BindCarrier=eth0 eth1
Gateway=172.16.160.254
Domains=jinmiaoluo.com.
DNS=223.5.5.5#dns.alidns.com
DNS=223.6.6.6#dns.alidns.com
DNSOverTLS=true

[Address]
Address=172.16.160.12/24

With this configuration, the server now has double the concurrent internal bandwidth.

Internal Network Isolation

Expanding further. Our servers are leased directly from a bare metal provider. To mitigate internal network risks, the data center isolates different companies’ bare metal servers using VLANs. Therefore, we need to configure VLANs alongside link aggregation. Here is the configuration:

# /etc/systemd/network/1-eth0.network
[Match]
Name=eth0

[Link]
MTUBytes=9000

[Network]
Bond=bond0
# /etc/systemd/network/1-eth1.network
[Match]
Name=eth1

[Link]
MTUBytes=9000

[Network]
Bond=bond0
# /etc/systemd/network/5-bond0.netdev
[NetDev]
Name=bond0
Kind=bond

[Bond]
Mode=802.3ad
TransmitHashPolicy=layer3+4
MIIMonitorSec=0.1s
LACPTransmitRate=fast
UpDelaySec=0.2s
DownDelaySec=0.2s
# /etc/systemd/network/5-bond0.network
[Match]
Name=bond0

[Link]
MTUBytes=9000

[Network]
VLAN=vlan500
BindCarrier=eth0 eth1
# /etc/systemd/network/10-vlan500.netdev
[NetDev]
Name=vlan500
Kind=vlan

[VLAN]
Id=500
# /etc/systemd/network/10-vlan500.network
[Match]
Name=vlan500

[Link]
MTUBytes=9000

[DHCPv4]
UseDNS=false

[DHCPv6]
UseDNS=false

[IPv6AcceptRA]
UseDNS=false

[Network]
Gateway=172.16.160.254
Domains=jinmiaoluo.com.
DNS=223.5.5.5#dns.alidns.com
DNS=223.6.6.6#dns.alidns.com
DNSOverTLS=true

[Address]
Address=172.16.160.12/24

With this configuration, the server can now communicate with other servers on the internal network.

Virtualization Bridge

Expanding further. After configuring link aggregation and VLAN, to avoid wasting bare metal server resources, we deploy virtualization (see this article). To allow virtual machines to obtain IPs from the same LAN subnet as the bare metal server, we need to configure a bridge. The configuration is as follows:

# /etc/systemd/network/1-eth0.network
[Match]
Name=eth0

[Link]
MTUBytes=9000

[Network]
Bond=bond0
# /etc/systemd/network/1-eth1.network
[Match]
Name=eth1

[Link]
MTUBytes=9000

[Network]
Bond=bond0
# /etc/systemd/network/5-bond0.netdev
[NetDev]
Name=bond0
Kind=bond

[Bond]
Mode=802.3ad
TransmitHashPolicy=layer3+4
MIIMonitorSec=0.1s
LACPTransmitRate=fast
UpDelaySec=0.2s
DownDelaySec=0.2s
# /etc/systemd/network/5-bond0.network
[Match]
Name=bond0

[Link]
MTUBytes=9000

[Network]
VLAN=vlan500
BindCarrier=eth0 eth1
# /etc/systemd/network/10-vlan500.netdev
[NetDev]
Name=vlan500
Kind=vlan

[VLAN]
Id=500
# /etc/systemd/network/10-vlan500.network
[Match]
Name=vlan500

[Network]
Bridge=virbr0
# /etc/systemd/network/15-virbr0.netdev
[NetDev]
Name=virbr0
Kind=bridge

[Bridge]
STP=yes
# /etc/systemd/network/15-virbr0.network
[Match]
Name=virbr0

[Network]
Gateway=172.16.160.254
Domains=jinmiaoluo.com.
DNS=223.5.5.5#dns.alidns.com
DNS=223.6.6.6#dns.alidns.com
DNSOverTLS=true

[Address]
Address=172.16.160.12/24

With this configuration, we can create multiple VMs with internal network IPs on top of link aggregation and VLAN (simply select bridge mode for the VM network).

Becoming a Cloud Server

Once virtualization is ready, turning a VM into a “cloud server” (a VM with a public IP) only requires configuring the appropriate mapping on the NAT device.

The NAT device maps all traffic for a specific public IP to an internal IP. This is exactly why we need to assign internal IPs (from the bare metal server’s LAN subnet) to VMs via a bridge.

Why Virtualization

Bare metal servers are many times more powerful than typical cloud servers. For example, our test server:

CPU: Intel Xeon Platinum 8352V * 2 (72 physical cores, 144 virtual cores)
MEM: 512G
GPU: 4090 * 8
NET: 25G * 2 (LACP)

Using it as a single server would waste resources. Virtualization serves two purposes: first, it improves resource utilization (considering that most services are idle most of the time); second, it isolates resources through virtualization, preventing uneven resource allocation caused by a single process (e.g., GPUs can be isolated into separate VMs via GPU passthrough).

Derived Practices

Combined with previous posts:

The following scenarios can be addressed:

  1. Private deployment of open-source large language models. Since the physical server has many GPUs, private deployment can improve local development efficiency. For example:

    1. Use open-webui for a web-based UI
    2. Use continue (a VS Code extension) for code analysis
    3. Use Immersive Translate (a browser extension) for efficient translation
  2. Migrating off public cloud. After leasing physical servers, virtualization and NAT enable you to create VMs with public IPs (similar to cloud servers).

  3. Cross-city office networking. Virtualization provides the runtime environment for common services (e.g., GitLab, K8s), while WireGuard connects the LANs. Common scenarios include:

    1. Cross-city frontend-backend debugging. Build a VPN network with WireGuard.
    2. Developers accessing K8s SVC IPs. LAN interconnection via Linux packet forwarding + static routes.
    3. Internal service deployment isolated from the public internet. Deploy services in VMs and use firewall rules so that only VPN-connected (i.e., internal) users can access services like GitLab and Ollama Server.
  4. Transparent proxy service. Development environments inevitably need to access overseas sites, and slow access hurts productivity.

    Common problems:

    • Slow large language model downloads
    • Slow Docker image pulls
    • Slow npm/pip/Go dependency downloads

    Solution:

    • Create a VM via virtualization
    • Deploy xray (using transparent proxy mode) as a sidecar gateway on the internal network (assume this VM’s internal IP is x.x.x.x)
    • Configure other VMs and bare metal servers to use x.x.x.x as their gateway via static IP, enabling accelerated overseas access

Summary

This article covers bare metal server network configuration using systemd-networkd, along with bridge configuration needed for virtualization deployment. Beyond the derived practices mentioned above, you may discover even more interesting use cases (e.g., NatMap, which lets you directly connect to any device on your home LAN at peak home bandwidth from a remote location – that one is left for the curious to explore).

Written on the second day of the Lunar New Year, at 1200bookshop in Guangzhou. The sunset is beautiful. Happy New Year.