infection, as discussed in the recent post on Compromise
Recovery (https://www.qubes-os.org/news/2017/04/26/qubes-compromise-recovery/), but for many cases this feature is highly desirable.
But opening up this flexibility comes at a price. We have to be careful about
which VMs can create which DispVMs. After all, it would be a disaster to allow
your casual Internet-browsing AppVM to spawn a DispVM based on your sensitive
work AppVM. The casual Internet-browsing VM could have the new DispVM open a
malicious file that compromises the DispVM. The compromised DispVM would then be
able to leak sensitive work-related data, since it uses the work VM as its
template.
There are two mechanisms in place to prevent such mistakes:
Each AppVM has a property called template_for_dispvms, which controls
whether this VM can serve as a template for Disposable VMs (i.e., whether any
DispVMs based on this VM are allowed in the system. By default,
this property is false for all AppVMs and needs to be manually enabled.
The choice of the template (i.e. specific AppVM) for the Disposable VM must
be provided by the qrexec policy (which the source VM cannot modify), and
defaults to the source VM’s default_dispvm property, which by default has a
value as specified via qubes-prefs. The resulting AppVM must have the
template_for_dispvms property set, or otherwise an error will occur.
Here we will take a look at how the template can be specified explicitly when
starting the DispVM from dom0:
[user@dom0 ~]$ qvm-run --dispvm=work --service qubes.StartApp+firefox
Running 'qubes.StartApp+firefox' on $dispvm:work
$dispvm:work: Refusing to create DispVM out of this AppVM, because template_for_dispvms=False
As mentioned above, we also need to explicitly enable use of the work AppVM as
Disposable VMs templates:
[user@dom0 ~]$ qvm-prefs --set work template_for_dispvms True
[user@dom0 ~]$ qvm-run --dispvm=work --service qubes.StartApp+firefox
Running 'qubes.StartApp+firefox' on $dispvm:work
We will look into how the DispVM can be specified via qrexec policy in a later
chapter below.
Qubes Remote Execution (qrexec): the underlying integration framework
Qubes OS is more than just a collection of isolated domains (currently
implemented as Xen VMs). The essential feature of Qubes OS, which sets
it apart from ordinary virtualization systems, is the unique way in which it
securely integrates these isolated domains for use in a single endpoint
system.
It’s probably accurate to say that the great majority of our efforts is on
building this integration in a manner so that it doesn’t ruin the isolation that
Xen provides to us.
There are several layers of integration infrastructure in Qubes OS, as depicted
in the following diagram (click for full size) and described below:
Recovery (https://www.qubes-os.org/news/2017/04/26/qubes-compromise-recovery/), but for many cases this feature is highly desirable.
But opening up this flexibility comes at a price. We have to be careful about
which VMs can create which DispVMs. After all, it would be a disaster to allow
your casual Internet-browsing AppVM to spawn a DispVM based on your sensitive
work AppVM. The casual Internet-browsing VM could have the new DispVM open a
malicious file that compromises the DispVM. The compromised DispVM would then be
able to leak sensitive work-related data, since it uses the work VM as its
template.
There are two mechanisms in place to prevent such mistakes:
Each AppVM has a property called template_for_dispvms, which controls
whether this VM can serve as a template for Disposable VMs (i.e., whether any
DispVMs based on this VM are allowed in the system. By default,
this property is false for all AppVMs and needs to be manually enabled.
The choice of the template (i.e. specific AppVM) for the Disposable VM must
be provided by the qrexec policy (which the source VM cannot modify), and
defaults to the source VM’s default_dispvm property, which by default has a
value as specified via qubes-prefs. The resulting AppVM must have the
template_for_dispvms property set, or otherwise an error will occur.
Here we will take a look at how the template can be specified explicitly when
starting the DispVM from dom0:
[user@dom0 ~]$ qvm-run --dispvm=work --service qubes.StartApp+firefox
Running 'qubes.StartApp+firefox' on $dispvm:work
$dispvm:work: Refusing to create DispVM out of this AppVM, because template_for_dispvms=False
As mentioned above, we also need to explicitly enable use of the work AppVM as
Disposable VMs templates:
[user@dom0 ~]$ qvm-prefs --set work template_for_dispvms True
[user@dom0 ~]$ qvm-run --dispvm=work --service qubes.StartApp+firefox
Running 'qubes.StartApp+firefox' on $dispvm:work
We will look into how the DispVM can be specified via qrexec policy in a later
chapter below.
Qubes Remote Execution (qrexec): the underlying integration framework
Qubes OS is more than just a collection of isolated domains (currently
implemented as Xen VMs). The essential feature of Qubes OS, which sets
it apart from ordinary virtualization systems, is the unique way in which it
securely integrates these isolated domains for use in a single endpoint
system.
It’s probably accurate to say that the great majority of our efforts is on
building this integration in a manner so that it doesn’t ruin the isolation that
Xen provides to us.
There are several layers of integration infrastructure in Qubes OS, as depicted
in the following diagram (click for full size) and described below:
First, there is the Xen-provided mechanism based on shared memory for
inter-VM communication (it’s called “grant tables” in Xen parlance). This is
then used to implement various Xen-provided drivers for networking and
virtual disk (called “block backends/frontends” by Xen). Generally, any
virtualization or containerization system provides some sort of similar
mechanisms, and we could easily use of any them on Qubes, because our Core
Stack uses libvirt to configure these mechanisms. Qubes does have some
specific requirements, however, the most unique one being that we want to put
the backends in unprivileged VMs, rather than in dom0 (or host or root
partition, however it is called in other systems). This is one of the reasons
why Xen still is our hypervisor of choice – most (all?) other systems assume
all the backends to be placed in the privileged domain, which clearly is
undesirable from the security point of view.
Next, we have the Qubes infrastructure layer, which builds on top of the
Xen-provided shared memory communication mechanism. The actual layer of
abstraction is called “vchan”. If we moved to another hypervisor, we would
need to have vchan implemented on top of whatever inter-VM infrastructure
that other hypervisor was to make available to us. Various Qubes-specific
mechanisms, such as our security-optimized GUI virtualization, builds on top
of vchan.
Finally, and most importantly, there is the Qubes-specific qrexec
infrastructure, which also builds on top of vchan, and which exposes
socket-like interfaces to processes running inside the VMs. This
infrastructure is governed by a centralized policy (enforced by the AdminVM).
Most of the Qubes-specific services and apps are built on top of qrexec, with
the notable exception of GUI virtualization, as mentioned earlier.
It’s important to understand several key properties of this qrexec
infrastructure, which distinguish it from other, seemingly similar, solutions.
First, qrexec does not attempt to perform any serialization (à la RPC). Instead,
it exposes only a simple “pipe”, pretty much like a TCP layer. This allows us to
keep the code which handles the incoming (untrusted) data very simple. Any kind
of (de-)serialization and parsing is offloaded to the specific code which has
been registered for that particular service (e.g. qubes.FileCopy).
This might seem like a superficial win. After all, the data needs to be
de-serialized and parsed somewhere, so why would it matter where exactly?
True, but what this model allows us to do is to selectively decide how much
risk we want to take for any specific domain by allowing (or not) specific
service calls from (specific or all) other domains.
For example, by decoupling the (more complex) logic of data parsing as used by
the qubes.FileCopy service from the core code of qrexec we can eliminate
potential attacks on the qubes.FileCopy server code by not allowing e.g. any
of the VMs tagged personal to issue this call to any of the VMs tagged work,
while at the same time still allowing all of these VMs to communicate with the
default clockvm (by default sys-net, adjustable via policy redirect as
discussed below) to request the qubes.GetDate service. We would not have such
a flexibility in trust partitioning if our qrexec infrastructure had
serialization built in (e.g. if it was implementing a protocol like DBus between
VMs). Of course, specific services might decide to use some complex serializing
protocols on top of qrexec (e.g. DBus) very easily, because qrexec connection is
seen as a socket by applications running in the VMs.
Another important characteristic is the policing of qrexec, which is
something we discuss in the next section.
Let’s write a simple qrexec service, which would be illustrative for what we
just discussed. Imagine we want to get the price of a Bitcoin (BTC), but the VM
where we need it has no network connectivity, perhaps because it contains some
inter-VM communication (it’s called “grant tables” in Xen parlance). This is
then used to implement various Xen-provided drivers for networking and
virtual disk (called “block backends/frontends” by Xen). Generally, any
virtualization or containerization system provides some sort of similar
mechanisms, and we could easily use of any them on Qubes, because our Core
Stack uses libvirt to configure these mechanisms. Qubes does have some
specific requirements, however, the most unique one being that we want to put
the backends in unprivileged VMs, rather than in dom0 (or host or root
partition, however it is called in other systems). This is one of the reasons
why Xen still is our hypervisor of choice – most (all?) other systems assume
all the backends to be placed in the privileged domain, which clearly is
undesirable from the security point of view.
Next, we have the Qubes infrastructure layer, which builds on top of the
Xen-provided shared memory communication mechanism. The actual layer of
abstraction is called “vchan”. If we moved to another hypervisor, we would
need to have vchan implemented on top of whatever inter-VM infrastructure
that other hypervisor was to make available to us. Various Qubes-specific
mechanisms, such as our security-optimized GUI virtualization, builds on top
of vchan.
Finally, and most importantly, there is the Qubes-specific qrexec
infrastructure, which also builds on top of vchan, and which exposes
socket-like interfaces to processes running inside the VMs. This
infrastructure is governed by a centralized policy (enforced by the AdminVM).
Most of the Qubes-specific services and apps are built on top of qrexec, with
the notable exception of GUI virtualization, as mentioned earlier.
It’s important to understand several key properties of this qrexec
infrastructure, which distinguish it from other, seemingly similar, solutions.
First, qrexec does not attempt to perform any serialization (à la RPC). Instead,
it exposes only a simple “pipe”, pretty much like a TCP layer. This allows us to
keep the code which handles the incoming (untrusted) data very simple. Any kind
of (de-)serialization and parsing is offloaded to the specific code which has
been registered for that particular service (e.g. qubes.FileCopy).
This might seem like a superficial win. After all, the data needs to be
de-serialized and parsed somewhere, so why would it matter where exactly?
True, but what this model allows us to do is to selectively decide how much
risk we want to take for any specific domain by allowing (or not) specific
service calls from (specific or all) other domains.
For example, by decoupling the (more complex) logic of data parsing as used by
the qubes.FileCopy service from the core code of qrexec we can eliminate
potential attacks on the qubes.FileCopy server code by not allowing e.g. any
of the VMs tagged personal to issue this call to any of the VMs tagged work,
while at the same time still allowing all of these VMs to communicate with the
default clockvm (by default sys-net, adjustable via policy redirect as
discussed below) to request the qubes.GetDate service. We would not have such
a flexibility in trust partitioning if our qrexec infrastructure had
serialization built in (e.g. if it was implementing a protocol like DBus between
VMs). Of course, specific services might decide to use some complex serializing
protocols on top of qrexec (e.g. DBus) very easily, because qrexec connection is
seen as a socket by applications running in the VMs.
Another important characteristic is the policing of qrexec, which is
something we discuss in the next section.
Let’s write a simple qrexec service, which would be illustrative for what we
just discussed. Imagine we want to get the price of a Bitcoin (BTC), but the VM
where we need it has no network connectivity, perhaps because it contains some
very sensitive data and we want to cut off as much interfaces to the untrusted
external world as possible.
In the untrusted VM we will create our simple server with the following body:
curl -s https://blockchain.info/q/24hrprice
The above should be pasted into the /usr/local/etc/qubes-rpc/my.GetBTCprice
file in our untrusted AppVM (let’s name it… untrusted). Let’s also test if
the service work (still from the untrusted VM):
[user@untrusted ~]$ sudo chmod +x /etc/qubes-rpc/my.GetBTCprice
[user@untrusted ~]$ /usr/local/etc/qubes-rpc/my.GetBTCprice
(...)
We should see the recent price of Bitcoin displayed.
Now, let’s create a network-disconnect, trusted AppVM called wallet:
[user@dom0 ~]$ qvm-create wallet -l blue --property netvm=""
[user@dom0 ~]$ qvm-ls
And now from a console in this new AppVM, let’s try to call our newly created
service:
[user@wallet ~]$ qrexec-client-vm untrusted my.GetBTCprice
(...)
The above command will invoke the (trusted) dialog box asking to confirm the
request and select the destination VM. For this experiment just select
untrusted from the “Target” drop-down list and acknowledge (in case you wonder
why the target VM is specified twice – once in the requesting VM and for the
2nd time in the trusted dialog box: we will discuss it in the next section). You
should see the price of the bitcoin printed as a result.
Before we move on to discussing the flexibility of qrexec policies, let’s pause
for a moment and recap what has just happened:
The network-disconnected, trusted VM called wallet requested the service
my.GetBTCprice from a VM named untrusted. The wallet VM had no way to
get the price of BTC, because it has no networking (for security).
After the user confirmed the call (by clicking on the trusted
dialog box, or by having a specific allow policy, as discussed below), it
got passed to the destination VM: untrusted.
The qrexec agent running in the untrusted VM invoked the handling
code for the my.GetBTCprice service. This code, in turn, performed a number
of complex actions: it TCP-connected to some server on the internet
(blockchain.info), performed some very complex crypto handshake to establish
an HTTPS connection to that server, then retrieved some complex data over
that connection, and finally returned the price of Bitcoin. There’s
likely hundreds of thousands lines of code involved in this operation.
Finally, whatever the my.GetBTCprice service returned on its stdout was
automagically taken by qrexec agent and piped back to the requesting VM, our
wallet VM.
The wallet VM got the data it wanted without the need to get involved in
all these complex operations, which take hundreds of thousands of lines of code
talking to untrusted computers over the network. That’s how we can improve the
security of this process without spending effort on auditing or hardening
the programs used (e.g. curl).
Of course that was a very simple example. But a very similar approach is used
among many Qubes services. E.g. system updates for dom0 are downloaded in
untrusted VMs and exposed to (otherwise network-disconnected) dom0 via the
qubes.ReceiveUpdates service (which later verifies digital signatures on the
packages). Another example is qubes.PdfConvert, which offloads the complex
parsing and rendering of PDFs to Disposable VMs and retrieves back only a very
simple format that is easily verified to be non-malicious. This simple format is
then converted back to a (now trusted) PDF.
More expressive qrexec policies
Because pretty much everything in Qubes which provides integration over the
compartmentalized domains is based on qrexec, it is imperative to have a
convenient (i.e. simple to use), secure (i.e. simple in implementation) yet
expressive enough mechanism to control who can request which qrexec services
from whom. Since the original qrexec policing was introduced in Qubes
release 1, the mechanism has undergone some slight gradual improvements.
external world as possible.
In the untrusted VM we will create our simple server with the following body:
curl -s https://blockchain.info/q/24hrprice
The above should be pasted into the /usr/local/etc/qubes-rpc/my.GetBTCprice
file in our untrusted AppVM (let’s name it… untrusted). Let’s also test if
the service work (still from the untrusted VM):
[user@untrusted ~]$ sudo chmod +x /etc/qubes-rpc/my.GetBTCprice
[user@untrusted ~]$ /usr/local/etc/qubes-rpc/my.GetBTCprice
(...)
We should see the recent price of Bitcoin displayed.
Now, let’s create a network-disconnect, trusted AppVM called wallet:
[user@dom0 ~]$ qvm-create wallet -l blue --property netvm=""
[user@dom0 ~]$ qvm-ls
And now from a console in this new AppVM, let’s try to call our newly created
service:
[user@wallet ~]$ qrexec-client-vm untrusted my.GetBTCprice
(...)
The above command will invoke the (trusted) dialog box asking to confirm the
request and select the destination VM. For this experiment just select
untrusted from the “Target” drop-down list and acknowledge (in case you wonder
why the target VM is specified twice – once in the requesting VM and for the
2nd time in the trusted dialog box: we will discuss it in the next section). You
should see the price of the bitcoin printed as a result.
Before we move on to discussing the flexibility of qrexec policies, let’s pause
for a moment and recap what has just happened:
The network-disconnected, trusted VM called wallet requested the service
my.GetBTCprice from a VM named untrusted. The wallet VM had no way to
get the price of BTC, because it has no networking (for security).
After the user confirmed the call (by clicking on the trusted
dialog box, or by having a specific allow policy, as discussed below), it
got passed to the destination VM: untrusted.
The qrexec agent running in the untrusted VM invoked the handling
code for the my.GetBTCprice service. This code, in turn, performed a number
of complex actions: it TCP-connected to some server on the internet
(blockchain.info), performed some very complex crypto handshake to establish
an HTTPS connection to that server, then retrieved some complex data over
that connection, and finally returned the price of Bitcoin. There’s
likely hundreds of thousands lines of code involved in this operation.
Finally, whatever the my.GetBTCprice service returned on its stdout was
automagically taken by qrexec agent and piped back to the requesting VM, our
wallet VM.
The wallet VM got the data it wanted without the need to get involved in
all these complex operations, which take hundreds of thousands of lines of code
talking to untrusted computers over the network. That’s how we can improve the
security of this process without spending effort on auditing or hardening
the programs used (e.g. curl).
Of course that was a very simple example. But a very similar approach is used
among many Qubes services. E.g. system updates for dom0 are downloaded in
untrusted VMs and exposed to (otherwise network-disconnected) dom0 via the
qubes.ReceiveUpdates service (which later verifies digital signatures on the
packages). Another example is qubes.PdfConvert, which offloads the complex
parsing and rendering of PDFs to Disposable VMs and retrieves back only a very
simple format that is easily verified to be non-malicious. This simple format is
then converted back to a (now trusted) PDF.
More expressive qrexec policies
Because pretty much everything in Qubes which provides integration over the
compartmentalized domains is based on qrexec, it is imperative to have a
convenient (i.e. simple to use), secure (i.e. simple in implementation) yet
expressive enough mechanism to control who can request which qrexec services
from whom. Since the original qrexec policing was introduced in Qubes
release 1, the mechanism has undergone some slight gradual improvements.
We still keep the policy as a collection of simple text files, located in
/etc/qubes-rpc/policy/ directory in the AdminVM (dom0). This allows for
automating of the policy packaging into RPM (trusted) packages, as well policy
customization from within our integrated Salt Stack via (trusted) Salt state
files.
Now, one of the coolest features we’ve introduced in Qubes 4.0 is the ability to
tag VMs and use these to make policy decisions.
Imagine we have several work-related domains. We can now tag all them with some
tag of our choosing, say work:
[user@dom0 user ~]$ qvm-tags itl-email add work
[user@dom0 user ~]$ qvm-tags accounting add work
[user@dom0 user ~]$ qvm-tags project-liberation add work
Now we can easily construct a qrexec policy, e.g. to constrain the file (or
clipbooard) copy operation, so that it’s allowed only between the VMs tagged as
work (while preventing file transfer to and from any VM not tagged with
work) – all we need is to add the following 3 extra rules in the policy file
for qubes.FileCopy:
[user@dom0 user ~]$ cat /etc/qubes-rpc/polisy/qubes.FileCopy
(...)
$tag:work $tag:work allow
$tag:work $anyvm deny
$anyvm $tag:work deny
$anyvm $anyvm ask
We can do the same for clipboard, just need to place the same 3 rules in
qubes.ClipboardPaste policy file.
Also, as already discussed in the previous post on Admin API (https://www.qubes-os.org/news/2017/06/27/qubes-admin-api/),
the Core Stack automatically tags VMs with created-by-XYZ tag, where XYZ is
replaced by the name of the VM which invoked the admin.vm.Create* service.
This allows to automatically constrain power of specific management VMs to only
manage its “own” VMs, and not others. Please refer to the article linked above
for the examples.
Furthermore, Disposable VMs can also be referred to via tags in the policy, for
example:
# Allow personal VMs to start DispVMs created by the user via guidom:
$tag:created-by-guidom $dispvm:$tag:created-by-guidom allow
# Allow corpo-managed VMs to start DispVMs based on corpo-owned AppVMs:
$tag:created-by-mgmt-corpo $dispvm:$tag:created-by-mgmt-corpo allow
Of course, we can also explicitly specify Disposable VMs using the
$dispvm: syntax, e.g. to allow AppVMs tagged
work to request qrexec service from a Disposable VM created off
work-printing AppVM (as noted earlier, the “work-printing” would need to have
it’s property template_for_dispvms set for this to work):
$anyvm $dispvm:work-printing allow
In Qubes 4.0 we have also implemented more strict control over the destination
argument for qrexec calls. Until Qubes 4.0, the source VM (i.e., the VM that
calls a service) was responsible for providing a valid destination VM name to
which it wanted to direct the service call (e.g. qubes.FileCopy or
qubes.OpenInVM). Of course, the policy always had the last say in this
process, and if the policy had a deny rule for the specific case, the service
call was dropped.
What has changed in Qubes 4.0 is that whenever the policy says ask (in
contrast to allow or deny), then the VM-provided destination is essentially
ignored, and instead a trusted prompt is displayed in dom0 to ask the user to
select the destination (with a convenient drop down list).
(One could argue that the VM-provided destination argument in qrexec policy call
is not entirely ignored, as it might be used to select a specific qrexec policy
line, in case there were a few matching, such as:
$anyvm work allow
$anyvm work-web ask,default_target=work-web
$anyvm work-email ask
In that case, however, the VM-provided call would still be overridden by the
user choice in case the rule #2 was selected.)
/etc/qubes-rpc/policy/ directory in the AdminVM (dom0). This allows for
automating of the policy packaging into RPM (trusted) packages, as well policy
customization from within our integrated Salt Stack via (trusted) Salt state
files.
Now, one of the coolest features we’ve introduced in Qubes 4.0 is the ability to
tag VMs and use these to make policy decisions.
Imagine we have several work-related domains. We can now tag all them with some
tag of our choosing, say work:
[user@dom0 user ~]$ qvm-tags itl-email add work
[user@dom0 user ~]$ qvm-tags accounting add work
[user@dom0 user ~]$ qvm-tags project-liberation add work
Now we can easily construct a qrexec policy, e.g. to constrain the file (or
clipbooard) copy operation, so that it’s allowed only between the VMs tagged as
work (while preventing file transfer to and from any VM not tagged with
work) – all we need is to add the following 3 extra rules in the policy file
for qubes.FileCopy:
[user@dom0 user ~]$ cat /etc/qubes-rpc/polisy/qubes.FileCopy
(...)
$tag:work $tag:work allow
$tag:work $anyvm deny
$anyvm $tag:work deny
$anyvm $anyvm ask
We can do the same for clipboard, just need to place the same 3 rules in
qubes.ClipboardPaste policy file.
Also, as already discussed in the previous post on Admin API (https://www.qubes-os.org/news/2017/06/27/qubes-admin-api/),
the Core Stack automatically tags VMs with created-by-XYZ tag, where XYZ is
replaced by the name of the VM which invoked the admin.vm.Create* service.
This allows to automatically constrain power of specific management VMs to only
manage its “own” VMs, and not others. Please refer to the article linked above
for the examples.
Furthermore, Disposable VMs can also be referred to via tags in the policy, for
example:
# Allow personal VMs to start DispVMs created by the user via guidom:
$tag:created-by-guidom $dispvm:$tag:created-by-guidom allow
# Allow corpo-managed VMs to start DispVMs based on corpo-owned AppVMs:
$tag:created-by-mgmt-corpo $dispvm:$tag:created-by-mgmt-corpo allow
Of course, we can also explicitly specify Disposable VMs using the
$dispvm: syntax, e.g. to allow AppVMs tagged
work to request qrexec service from a Disposable VM created off
work-printing AppVM (as noted earlier, the “work-printing” would need to have
it’s property template_for_dispvms set for this to work):
$anyvm $dispvm:work-printing allow
In Qubes 4.0 we have also implemented more strict control over the destination
argument for qrexec calls. Until Qubes 4.0, the source VM (i.e., the VM that
calls a service) was responsible for providing a valid destination VM name to
which it wanted to direct the service call (e.g. qubes.FileCopy or
qubes.OpenInVM). Of course, the policy always had the last say in this
process, and if the policy had a deny rule for the specific case, the service
call was dropped.
What has changed in Qubes 4.0 is that whenever the policy says ask (in
contrast to allow or deny), then the VM-provided destination is essentially
ignored, and instead a trusted prompt is displayed in dom0 to ask the user to
select the destination (with a convenient drop down list).
(One could argue that the VM-provided destination argument in qrexec policy call
is not entirely ignored, as it might be used to select a specific qrexec policy
line, in case there were a few matching, such as:
$anyvm work allow
$anyvm work-web ask,default_target=work-web
$anyvm work-email ask
In that case, however, the VM-provided call would still be overridden by the
user choice in case the rule #2 was selected.)
A prime example of where this is used is the qubes.FileCopy service. However, we
should note that for most other services, in a well configured system, there
should be very few ask rules. Instead most policies should be either allow
or deny, thereby relieving the user from having to make a security decision with every
service invocation. Even the qubes.FileCopy service should be additionally guarded by
deny rules (e.g. forbidding any file transfers between personal and
work-related VMs), and we believe our integrated Salt Management Stack should be
helpful in creating such policies in larger deployments of Qubes within
corporations.
Here comes in handy another powerful feature of qrexec policy: the target=
specifier, which can be added after the action keyword. This forces the call to
be directed to the specific destination VM, no matter what the source VM
specified. A good place to make use of this is in policies for starting various
Disposable VMs. For example, we might have a special rule for Disposable VMs
which can be invoked only from VMs tagged with work (e.g. for the
qubes.OpenInVM service):
$tag:work $dispvm allow,target=$dispvm:work-printing
$anyvm $dispvm:work-printing deny
The first line means that every DispVM created by a VM tagged with work will
be based on (i.e., use as its template) the work-printing VM. (Recall that, in
order for this to succeed, the work-printing VM also has to have its
template_for_dispvms property set to true.) The second line means that any
other VM (i.e., any VM not tagged with work) will be denied from creating
DispVMs based on the work-printing VM.
qubes.xml, qubesd, and the Admin API
Since the beginning, Qubes Core Stack kept all the information about the Qubes
system’s (persistent) configuration within the /var/lib/qubes/qubes.xml file.
This has included information such as what VMs are defined, on which templates
they are based, how they are network-connected, etc. This file has never been
intended to be edited by the user by hand (except for rare system
troubleshooting). Instead, Qubes has provided lots of tools, such as
qvm-create, qvm-prefs and qubes-prefs, and many more, which operate or
make use of the information from this file.
In Qubes 4.x we’ve introduced the qubesd daemon (service) which is now the
only entity which has direct access to the qubes.xml file, and which exposes a
well-defined API to other tools. This API is used by a few internal tools
running in dom0, such as some power management noscripts, qrexec policy checker,
and qubesd-query(-fast) wrapper, which in turn is used to expose most parts of
this API to other VMs via the qrexec infrastructure. This is what we call Admin
API, and which we I described in the previous post (https://www.qubes-os.org/news/2017/06/27/qubes-admin-api/). While the
mapping between Admin API and the internal qubesd API is nearly “one-to-one”,
the primary difference is that Admin API is subject to policy mechanism via
qrexec, while the qubesd-exposed API is not policed, because it is only exposed
locally within the dom0. This architecture is depicted below.
should note that for most other services, in a well configured system, there
should be very few ask rules. Instead most policies should be either allow
or deny, thereby relieving the user from having to make a security decision with every
service invocation. Even the qubes.FileCopy service should be additionally guarded by
deny rules (e.g. forbidding any file transfers between personal and
work-related VMs), and we believe our integrated Salt Management Stack should be
helpful in creating such policies in larger deployments of Qubes within
corporations.
Here comes in handy another powerful feature of qrexec policy: the target=
specifier, which can be added after the action keyword. This forces the call to
be directed to the specific destination VM, no matter what the source VM
specified. A good place to make use of this is in policies for starting various
Disposable VMs. For example, we might have a special rule for Disposable VMs
which can be invoked only from VMs tagged with work (e.g. for the
qubes.OpenInVM service):
$tag:work $dispvm allow,target=$dispvm:work-printing
$anyvm $dispvm:work-printing deny
The first line means that every DispVM created by a VM tagged with work will
be based on (i.e., use as its template) the work-printing VM. (Recall that, in
order for this to succeed, the work-printing VM also has to have its
template_for_dispvms property set to true.) The second line means that any
other VM (i.e., any VM not tagged with work) will be denied from creating
DispVMs based on the work-printing VM.
qubes.xml, qubesd, and the Admin API
Since the beginning, Qubes Core Stack kept all the information about the Qubes
system’s (persistent) configuration within the /var/lib/qubes/qubes.xml file.
This has included information such as what VMs are defined, on which templates
they are based, how they are network-connected, etc. This file has never been
intended to be edited by the user by hand (except for rare system
troubleshooting). Instead, Qubes has provided lots of tools, such as
qvm-create, qvm-prefs and qubes-prefs, and many more, which operate or
make use of the information from this file.
In Qubes 4.x we’ve introduced the qubesd daemon (service) which is now the
only entity which has direct access to the qubes.xml file, and which exposes a
well-defined API to other tools. This API is used by a few internal tools
running in dom0, such as some power management noscripts, qrexec policy checker,
and qubesd-query(-fast) wrapper, which in turn is used to expose most parts of
this API to other VMs via the qrexec infrastructure. This is what we call Admin
API, and which we I described in the previous post (https://www.qubes-os.org/news/2017/06/27/qubes-admin-api/). While the
mapping between Admin API and the internal qubesd API is nearly “one-to-one”,
the primary difference is that Admin API is subject to policy mechanism via
qrexec, while the qubesd-exposed API is not policed, because it is only exposed
locally within the dom0. This architecture is depicted below.
As discussed in the post about Admin API (https://www.qubes-os.org/news/2017/06/27/qubes-admin-api/), we have put lots of
thought into designing the API in such a way as to allow effective split between
the user and admin roles. Security-wise this means that admins should be able to
manage configurations and policies in the system, but not be able to access the
user data (i.e. AppVMs’ private images). Likewise it should be possible to
prevent users from changing the policies of the system, while allowing them to
use their data (but perhaps not export/leak them easily outside the system).
For completeness I’d like to mention that both qrexec and firewalling policies
are not included in the central qubes.xml file, but rather in separate
locations, i.e. in /etc/qubes-rpc/policy/ and
/var/lib/qubes//firewall.xml respectively. This allows for easy
updating of the policy files, e.g. from within trusted RPMs that are installed
in dom0 and which might be brining new qrexec services, or from whatever tool
used to create/manage firewalling policies.
Finally, a note that qubesd (as in “qubes-daemon”) should not be confused with
[qubes-db][https://github.com/QubesOS/qubes-core-qubesdb] (as in
“qubes-database”). The latter is a Qubes-provided, security-optimized abstraction
for exposing static informations from one VMs to others (mostly from AdminVM),
and which is used e.g. for the agents in the VMs to get to know the VM’s name,
type and other configuration options.
The new attack surface?
With all these changes to the Qubes Core Stack, an important question comes to
mind: how do all these changes affect the security of the system?
In an attempt to provide somewhat meaningful answer to that question, we should
first observe that there exists a number of obvious configurations (including
the default one) of the system, in which there should be no security regression
compared to previous Qubes versions.
Indeed, by not allowing any other VM to access the Admin API (which is what the
default qrexec policy for Admin API does), we essentially reduce the attack
surface onto the Core Stack to what is has been in the previous versions (modulo
potentially more complex policy parser, as discussed below).
Let us now imagine exposing some subset of the Admin API to select, trusted
management VMs, such as the upcoming GUI domain (in Qubes 4.1). As long as we
consider these select VMs as “trusted”, again the situation does not seem to be
any worse that what it was before (we can simply think of dom0 as having
comprised these additional VMs in previous versions of Qubes. Certainly there is
no security benefit here, but likewise there is no added risk).
Now let’s move a step further and relax our trustworthiness requirement for this,
say, GUI domain. We will now consider it only “somewhat trustworthy”. The whole
promise of the new Admin API is that, with a reasonably policed Admin API (see
also the previous post), even if this domain gets compromised, this will not
result in full system compromise, and ideally only in some kind of a DoS where
none of the user data will get compromised. Of course, in such a situation there
is additional attack surface that should be taken into account, such as the
qubesd-exposed interface. In case of a hypothetical bug in the implementation of
the qubesd-exposed interface (which is heavily sanitized and also written in
Python, but still) the attacker who compromised our “somewhat trustworthy” GUI
domain might compromise the whole system. But then again, let’s remember that
without the Admin API we would not have a “somewhat trustworthy” GUI domain in the
first place, and if we assume it was possible for the attacker to compromise
this VM, then she would also be able to compromise dom0 in earlier Qubes
versions. That would be fatal without any additional preconditions (e.g. for a
bug in qubesd).
Finally, we have the case of a “largely untrusted” management VM. The typical
thought into designing the API in such a way as to allow effective split between
the user and admin roles. Security-wise this means that admins should be able to
manage configurations and policies in the system, but not be able to access the
user data (i.e. AppVMs’ private images). Likewise it should be possible to
prevent users from changing the policies of the system, while allowing them to
use their data (but perhaps not export/leak them easily outside the system).
For completeness I’d like to mention that both qrexec and firewalling policies
are not included in the central qubes.xml file, but rather in separate
locations, i.e. in /etc/qubes-rpc/policy/ and
/var/lib/qubes//firewall.xml respectively. This allows for easy
updating of the policy files, e.g. from within trusted RPMs that are installed
in dom0 and which might be brining new qrexec services, or from whatever tool
used to create/manage firewalling policies.
Finally, a note that qubesd (as in “qubes-daemon”) should not be confused with
[qubes-db][https://github.com/QubesOS/qubes-core-qubesdb] (as in
“qubes-database”). The latter is a Qubes-provided, security-optimized abstraction
for exposing static informations from one VMs to others (mostly from AdminVM),
and which is used e.g. for the agents in the VMs to get to know the VM’s name,
type and other configuration options.
The new attack surface?
With all these changes to the Qubes Core Stack, an important question comes to
mind: how do all these changes affect the security of the system?
In an attempt to provide somewhat meaningful answer to that question, we should
first observe that there exists a number of obvious configurations (including
the default one) of the system, in which there should be no security regression
compared to previous Qubes versions.
Indeed, by not allowing any other VM to access the Admin API (which is what the
default qrexec policy for Admin API does), we essentially reduce the attack
surface onto the Core Stack to what is has been in the previous versions (modulo
potentially more complex policy parser, as discussed below).
Let us now imagine exposing some subset of the Admin API to select, trusted
management VMs, such as the upcoming GUI domain (in Qubes 4.1). As long as we
consider these select VMs as “trusted”, again the situation does not seem to be
any worse that what it was before (we can simply think of dom0 as having
comprised these additional VMs in previous versions of Qubes. Certainly there is
no security benefit here, but likewise there is no added risk).
Now let’s move a step further and relax our trustworthiness requirement for this,
say, GUI domain. We will now consider it only “somewhat trustworthy”. The whole
promise of the new Admin API is that, with a reasonably policed Admin API (see
also the previous post), even if this domain gets compromised, this will not
result in full system compromise, and ideally only in some kind of a DoS where
none of the user data will get compromised. Of course, in such a situation there
is additional attack surface that should be taken into account, such as the
qubesd-exposed interface. In case of a hypothetical bug in the implementation of
the qubesd-exposed interface (which is heavily sanitized and also written in
Python, but still) the attacker who compromised our “somewhat trustworthy” GUI
domain might compromise the whole system. But then again, let’s remember that
without the Admin API we would not have a “somewhat trustworthy” GUI domain in the
first place, and if we assume it was possible for the attacker to compromise
this VM, then she would also be able to compromise dom0 in earlier Qubes
versions. That would be fatal without any additional preconditions (e.g. for a
bug in qubesd).
Finally, we have the case of a “largely untrusted” management VM. The typical
scenario could be a management VM “owned” by an organization/corporation. As
explained in the previous post, the Admin API should allow us to grant such a VM
authority over only a subset of VMs, specifically only those which it created,
and not any other (through the convenient created-by-XYZ tags in the policy).
Now, if we consider this VM to become compromised, e.g. as a result of the
organization’s proprietary management agents getting compromised somehow, then
it becomes a very urging question to answer how buggy the qubesd-exposed
interface might be. Again, on most (all?) other client system, such a situation
would be fatal immediately (i.e. no additional attacks would be required after
the attacker compromised the agent), while on Qubes this would only be the
prelude for trying another attacks to get to dom0.
One other aspect which might not be immediately clear is the trade-off between a
more flexible architecture, which e.g. allows to create mutually untrusted
management VMs on the one hand, and the increased complexity of e.g. the policy
checker, which is required to now also understand new keywords such as the
previously introduced $dispvm:xyz or $tag:xyz. In general we believe that if
we can introduce a significant new security improvement on architecture level,
which allows to further decompose the TCB of the system, than it is worth it.
This is because architecture-level security should always go first, before
implementation-level security. Indeed, the latter can always be patched, and in
many cases won’t be critical (because e.g. smart architecture will keep it
outside of the TCB), while the architecture very often cannot be so easily
“fixed”. In fact this is the prime reason why we have Qubes OS, i.e. because
fixing of the monolithic architecture of the mainstream OSes has seemed hopeless
to us.
Summary
The new Qubes Core Stack provides a very flexible framework for managing a
compartmentalized desktop (client) oriented system. Compared to previous Qubes
Core Stacks, it offers much more flexibility, which translates to ability to
further decompose the system into more (largely) mutually untrusting parts.
Some readers might wonder how does the Qubes Core Stack actually compare to
various popular cloud/server/virtualization management APIs, such as
OpenStack/EC2 or even Docker?
While at first sight there might be quite a few similarities related to
management of VMs or containers, the primary differentiating factor is that
Qubes Core Stack has been designed and optimized to bring user one desktop
system built on top of multiple isolated domains (currently implemented as Xen
Virtual Machines, but in the future maybe on top of something else), rather than
for management of service-oriented infrastructure, where the services are
largely independent from each other and where the prime consideration is
scalability.
The Qubes Core Stack is Xen- and virtualization-independent, and should be
easily portable to any compartmentalization technology.
In the upcoming article we will take a look at the updated device and new volume
management in Qubes 4.0.
explained in the previous post, the Admin API should allow us to grant such a VM
authority over only a subset of VMs, specifically only those which it created,
and not any other (through the convenient created-by-XYZ tags in the policy).
Now, if we consider this VM to become compromised, e.g. as a result of the
organization’s proprietary management agents getting compromised somehow, then
it becomes a very urging question to answer how buggy the qubesd-exposed
interface might be. Again, on most (all?) other client system, such a situation
would be fatal immediately (i.e. no additional attacks would be required after
the attacker compromised the agent), while on Qubes this would only be the
prelude for trying another attacks to get to dom0.
One other aspect which might not be immediately clear is the trade-off between a
more flexible architecture, which e.g. allows to create mutually untrusted
management VMs on the one hand, and the increased complexity of e.g. the policy
checker, which is required to now also understand new keywords such as the
previously introduced $dispvm:xyz or $tag:xyz. In general we believe that if
we can introduce a significant new security improvement on architecture level,
which allows to further decompose the TCB of the system, than it is worth it.
This is because architecture-level security should always go first, before
implementation-level security. Indeed, the latter can always be patched, and in
many cases won’t be critical (because e.g. smart architecture will keep it
outside of the TCB), while the architecture very often cannot be so easily
“fixed”. In fact this is the prime reason why we have Qubes OS, i.e. because
fixing of the monolithic architecture of the mainstream OSes has seemed hopeless
to us.
Summary
The new Qubes Core Stack provides a very flexible framework for managing a
compartmentalized desktop (client) oriented system. Compared to previous Qubes
Core Stacks, it offers much more flexibility, which translates to ability to
further decompose the system into more (largely) mutually untrusting parts.
Some readers might wonder how does the Qubes Core Stack actually compare to
various popular cloud/server/virtualization management APIs, such as
OpenStack/EC2 or even Docker?
While at first sight there might be quite a few similarities related to
management of VMs or containers, the primary differentiating factor is that
Qubes Core Stack has been designed and optimized to bring user one desktop
system built on top of multiple isolated domains (currently implemented as Xen
Virtual Machines, but in the future maybe on top of something else), rather than
for management of service-oriented infrastructure, where the services are
largely independent from each other and where the prime consideration is
scalability.
The Qubes Core Stack is Xen- and virtualization-independent, and should be
easily portable to any compartmentalization technology.
In the upcoming article we will take a look at the updated device and new volume
management in Qubes 4.0.
RT @x0rz: You know the security model is broken when you can access BIOS through a #maldoc... go @QubesOS https://t.co/HZzK285s7i
Twitter
Requiem
Malicious update attack targeting BIOS from #maldoc and #powershell dropper 🤘 #VB2017
RT @ttaskett: Apple could learn something from @QubesOS about visual security context. #infosec https://t.co/z4mOHP06AB
Twitter
Felix Krause
📝 One of these is Apple asking you for your password and the other one is a phishing popup that steals your password https://t.co/PdOJcthqL7
Qubes Security Bulletin #34: a bunch of Xen bugs (believed to be DoS only) & a (rather minor) bug in GUI coloring:
https://t.co/aczRAsPxho
https://t.co/aczRAsPxho
GitHub
QubesOS/qubes-secpack
qubes-secpack - Qubes Security Pack
QSB #34: GUI issue and Xen vulnerabilities (XSA-237 through XSA-244)
https://www.qubes-os.org/news/2017/10/12/qsb-34/
Dear Qubes Community,
We have just published Qubes Security Bulletin (QSB) #34:
GUI issue and Xen vulnerabilities (XSA-237 through XSA-244).
The text of this QSB is reproduced below. This QSB and its accompanying
signatures will always be available in the Qubes Security Pack (qubes-secpack).
View QSB #34 in the qubes-secpack:
https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-034-2017.txt
Learn about the qubes-secpack, including how to obtain, verify, and read it:
https://www.qubes-os.org/security/pack/
View all past QSBs:
https://www.qubes-os.org/security/bulletins/
View the XSA Tracker:
https://www.qubes-os.org/security/xsa/
---===[ Qubes Security Bulletin #34 ]===---
October 12, 2017
GUI issue and Xen vulnerabilities (XSA-237 through XSA-244)
Summary
========
One of our developers, Simon Gaiser (aka HW42), while working on
improving support for device isolation in Qubes 4.0, discovered a
potential security problem with the way Xen handles MSI-capable devices.
The Xen Security Team has classified this problem as XSA-237 [01], which
was published today.
At the same time, the Xen Security Team released several other Xen
Security Advisories (XSA-238 through XSA-244). The impact of these
advisories ranges from system crashes to potential privilege
escalations. However, the latter seem to be mostly theoretical. See our
commentary below for details.
Finally, Eric Larsson discovered a situation in which Qubes GUI
virtualization could allow a VM to produce a window that has no colored
borders (which are used in Qubes as front-line indicators of trust).
A VM cannot use this vulnerability to draw different borders in place of
the correct one, however. We discuss this issue extensively below.
Technical details
==================
Xen issues
-----------
Xen Security Advisory 237 [01]:
| Multiple issues exist with the setup of PCI MSI interrupts:
| - unprivileged guests were permitted access to devices not owned by
| them, in particular allowing them to disable MSI or MSI-X on any
| device
| - HVM guests can trigger a codepath intended only for PV guests
| - some failure paths partially tear down previously configured
| interrupts, leaving inconsistent state
| - with XSM enabled, caller and callee of a hook disagreed about the
| data structure pointed to by a type-less argument
|
| A malicious or buggy guest may cause the hypervisor to crash, resulting
| in Denial of Service (DoS) affecting the entire host. Privilege
| escalation and information leaks cannot be excluded.
Xen Security Advisory 238 [02]:
| DMOPs (which were a subgroup of HVMOPs in older releases) allow guests
| to control and drive other guests. The I/O request server page mapping
| interface uses range sets to represent I/O resources the emulation of
| which is provided by a given I/O request server. The internals of the
| range set implementation require that ranges have a starting value no
| lower than the ending one. Checks for this fact were missing.
|
| Malicious or buggy stub domain kernels or tool stacks otherwise living
| outside of Domain0 can mount a denial of service attack which, if
| successful, can affect the whole system.
|
| Only domains controlling HVM guests can exploit this vulnerability.
| (This includes domains providing hardware emulation services to HVM
| guests.)
Xen Security Advisory 239 [03]:
| Intercepted I/O operations may deal with less than a full machine
| word's worth of data. While read paths had been the subject of earlier
| XSAs (and hence have been fixed), at least one write path was found
| where the data stored into an internal structure could contain bits
| from an uninitialized hypervisor stack slot. A subsequent emulated
| read would then be able to retrieve these bits.
|
https://www.qubes-os.org/news/2017/10/12/qsb-34/
Dear Qubes Community,
We have just published Qubes Security Bulletin (QSB) #34:
GUI issue and Xen vulnerabilities (XSA-237 through XSA-244).
The text of this QSB is reproduced below. This QSB and its accompanying
signatures will always be available in the Qubes Security Pack (qubes-secpack).
View QSB #34 in the qubes-secpack:
https://github.com/QubesOS/qubes-secpack/blob/master/QSBs/qsb-034-2017.txt
Learn about the qubes-secpack, including how to obtain, verify, and read it:
https://www.qubes-os.org/security/pack/
View all past QSBs:
https://www.qubes-os.org/security/bulletins/
View the XSA Tracker:
https://www.qubes-os.org/security/xsa/
---===[ Qubes Security Bulletin #34 ]===---
October 12, 2017
GUI issue and Xen vulnerabilities (XSA-237 through XSA-244)
Summary
========
One of our developers, Simon Gaiser (aka HW42), while working on
improving support for device isolation in Qubes 4.0, discovered a
potential security problem with the way Xen handles MSI-capable devices.
The Xen Security Team has classified this problem as XSA-237 [01], which
was published today.
At the same time, the Xen Security Team released several other Xen
Security Advisories (XSA-238 through XSA-244). The impact of these
advisories ranges from system crashes to potential privilege
escalations. However, the latter seem to be mostly theoretical. See our
commentary below for details.
Finally, Eric Larsson discovered a situation in which Qubes GUI
virtualization could allow a VM to produce a window that has no colored
borders (which are used in Qubes as front-line indicators of trust).
A VM cannot use this vulnerability to draw different borders in place of
the correct one, however. We discuss this issue extensively below.
Technical details
==================
Xen issues
-----------
Xen Security Advisory 237 [01]:
| Multiple issues exist with the setup of PCI MSI interrupts:
| - unprivileged guests were permitted access to devices not owned by
| them, in particular allowing them to disable MSI or MSI-X on any
| device
| - HVM guests can trigger a codepath intended only for PV guests
| - some failure paths partially tear down previously configured
| interrupts, leaving inconsistent state
| - with XSM enabled, caller and callee of a hook disagreed about the
| data structure pointed to by a type-less argument
|
| A malicious or buggy guest may cause the hypervisor to crash, resulting
| in Denial of Service (DoS) affecting the entire host. Privilege
| escalation and information leaks cannot be excluded.
Xen Security Advisory 238 [02]:
| DMOPs (which were a subgroup of HVMOPs in older releases) allow guests
| to control and drive other guests. The I/O request server page mapping
| interface uses range sets to represent I/O resources the emulation of
| which is provided by a given I/O request server. The internals of the
| range set implementation require that ranges have a starting value no
| lower than the ending one. Checks for this fact were missing.
|
| Malicious or buggy stub domain kernels or tool stacks otherwise living
| outside of Domain0 can mount a denial of service attack which, if
| successful, can affect the whole system.
|
| Only domains controlling HVM guests can exploit this vulnerability.
| (This includes domains providing hardware emulation services to HVM
| guests.)
Xen Security Advisory 239 [03]:
| Intercepted I/O operations may deal with less than a full machine
| word's worth of data. While read paths had been the subject of earlier
| XSAs (and hence have been fixed), at least one write path was found
| where the data stored into an internal structure could contain bits
| from an uninitialized hypervisor stack slot. A subsequent emulated
| read would then be able to retrieve these bits.
|
| A malicious unprivileged x86 HVM guest may be able to obtain sensitive
| information from the host or other guests.
Xen Security Advisory 240 [04]:
| x86 PV guests are permitted to set up certain forms of what is often
| called "linear page tables", where pagetables contain references to
| other pagetables at the same level or higher. Certain restrictions
| apply in order to fit into Xen's page type handling system. An
| important restriction was missed, however: Stacking multiple layers
| of page tables of the same level on top of one another is not very
| useful, and the tearing down of such an arrangement involves
| recursion. With sufficiently many layers such recursion will result
| in a stack overflow, commonly resulting in Xen to crash.
|
| A malicious or buggy PV guest may cause the hypervisor to crash,
| resulting in Denial of Service (DoS) affecting the entire host.
| Privilege escalation and information leaks cannot be excluded.
Xen Security Advisory 241 [05]:
| x86 PV guests effect TLB flushes by way of a hypercall. Xen tries to
| reduce the number of TLB flushes by delaying them as much as possible.
| When the last type reference of a page is dropped, the need for a TLB
| flush (before the page is re-used) is recorded. If a guest TLB flush
| request involves an Inter Processor Interrupt (IPI) to a CPU in which
| is the process of dropping the last type reference of some page, and
| if that IPI arrives at exactly the right instruction boundary, a stale
| time stamp may be recorded, possibly resulting in the later omission
| of the necessary TLB flush for that page.
|
| A malicious x86 PV guest may be able to access all of system memory,
| allowing for all of privilege escalation, host crashes, and
| information leaks.
Xen Security Advisory 242 [06]:
| The page type system of Xen requires cleanup when the last reference
| for a given page is being dropped. In order to exclude simultaneous
| updates to a given page by multiple parties, pages which are updated
| are locked beforehand. This locking includes temporarily increasing
| the type reference count by one. When the page is later unlocked, the
| context precludes cleanup, so the reference that is then dropped must
| not be the last one. This was not properly enforced.
|
| A malicious or buggy PV guest may cause a memory leak upon shutdown
| of the guest, ultimately perhaps resulting in Denial of Service (DoS)
| affecting the entire host.
Xen Security Advisory 243 [07]:
| The shadow pagetable code uses linear mappings to inspect and modify the
| shadow pagetables. A linear mapping which points back to itself is known as
| self-linear. For translated guests, the shadow linear mappings (being in a
| separate address space) are not intended to be self-linear. For
| non-translated guests, the shadow linear mappings (being the same
| address space) are intended to be self-linear.
|
| When constructing a monitor pagetable for Xen to run on a vcpu with, the shadow
| linear slot is filled with a self-linear mapping, and for translated guests,
| shortly thereafter replaced with a non-self-linear mapping, when the guest's
| %cr3 is shadowed.
|
| However when writeable heuristics are used, the shadow mappings are used as
| part of shadowing %cr3, causing the heuristics to be applied to Xen's
| pagetables, not the guest shadow pagetables.
|
| While investigating, it was also identified that PV auto-translate mode was
| insecure. This mode was removed in Xen 4.7 due to being unused, unmaintained
| and presumed broken. We are not aware of any guest implementation of PV
| auto-translate mode.
|
| A malicious or buggy HVM guest may cause a hypervisor crash, resulting in a
| Denial of Service (DoS) affecting the entire host, or cause hypervisor memory
| corruption. We cannot rule out a guest being able to escalate its privilege.
Xen Security Advisory 244 [08]:
| The x86-64 architecture allows interrupts to be run on distinct stacks.
| information from the host or other guests.
Xen Security Advisory 240 [04]:
| x86 PV guests are permitted to set up certain forms of what is often
| called "linear page tables", where pagetables contain references to
| other pagetables at the same level or higher. Certain restrictions
| apply in order to fit into Xen's page type handling system. An
| important restriction was missed, however: Stacking multiple layers
| of page tables of the same level on top of one another is not very
| useful, and the tearing down of such an arrangement involves
| recursion. With sufficiently many layers such recursion will result
| in a stack overflow, commonly resulting in Xen to crash.
|
| A malicious or buggy PV guest may cause the hypervisor to crash,
| resulting in Denial of Service (DoS) affecting the entire host.
| Privilege escalation and information leaks cannot be excluded.
Xen Security Advisory 241 [05]:
| x86 PV guests effect TLB flushes by way of a hypercall. Xen tries to
| reduce the number of TLB flushes by delaying them as much as possible.
| When the last type reference of a page is dropped, the need for a TLB
| flush (before the page is re-used) is recorded. If a guest TLB flush
| request involves an Inter Processor Interrupt (IPI) to a CPU in which
| is the process of dropping the last type reference of some page, and
| if that IPI arrives at exactly the right instruction boundary, a stale
| time stamp may be recorded, possibly resulting in the later omission
| of the necessary TLB flush for that page.
|
| A malicious x86 PV guest may be able to access all of system memory,
| allowing for all of privilege escalation, host crashes, and
| information leaks.
Xen Security Advisory 242 [06]:
| The page type system of Xen requires cleanup when the last reference
| for a given page is being dropped. In order to exclude simultaneous
| updates to a given page by multiple parties, pages which are updated
| are locked beforehand. This locking includes temporarily increasing
| the type reference count by one. When the page is later unlocked, the
| context precludes cleanup, so the reference that is then dropped must
| not be the last one. This was not properly enforced.
|
| A malicious or buggy PV guest may cause a memory leak upon shutdown
| of the guest, ultimately perhaps resulting in Denial of Service (DoS)
| affecting the entire host.
Xen Security Advisory 243 [07]:
| The shadow pagetable code uses linear mappings to inspect and modify the
| shadow pagetables. A linear mapping which points back to itself is known as
| self-linear. For translated guests, the shadow linear mappings (being in a
| separate address space) are not intended to be self-linear. For
| non-translated guests, the shadow linear mappings (being the same
| address space) are intended to be self-linear.
|
| When constructing a monitor pagetable for Xen to run on a vcpu with, the shadow
| linear slot is filled with a self-linear mapping, and for translated guests,
| shortly thereafter replaced with a non-self-linear mapping, when the guest's
| %cr3 is shadowed.
|
| However when writeable heuristics are used, the shadow mappings are used as
| part of shadowing %cr3, causing the heuristics to be applied to Xen's
| pagetables, not the guest shadow pagetables.
|
| While investigating, it was also identified that PV auto-translate mode was
| insecure. This mode was removed in Xen 4.7 due to being unused, unmaintained
| and presumed broken. We are not aware of any guest implementation of PV
| auto-translate mode.
|
| A malicious or buggy HVM guest may cause a hypervisor crash, resulting in a
| Denial of Service (DoS) affecting the entire host, or cause hypervisor memory
| corruption. We cannot rule out a guest being able to escalate its privilege.
Xen Security Advisory 244 [08]:
| The x86-64 architecture allows interrupts to be run on distinct stacks.
| The choice of stack is encoded in a field of the corresponding
| interrupt denoscriptor in the Interrupt Denoscriptor Table (IDT). That
| field selects an entry from the active Task State Segment (TSS).
|
| Since, on AMD hardware, Xen switches to an HVM guest's TSS before
| actually entering the guest, with the Global Interrupt Flag still set,
| the selectors in the IDT entry are switched when guest context is
| loaded/unloaded.
|
| When a new CPU is brought online, its IDT is copied from CPU0's IDT,
| including those selector fields. If CPU0 happens at that moment to be
| in HVM context, wrong values for those IDT fields would be installed
| for the new CPU. If the first guest vCPU to be run on that CPU
| belongs to a PV guest, it will then have the ability to escalate its
| privilege or crash the hypervisor.
|
| A malicious or buggy x86 PV guest could escalate its privileges or
| crash the hypervisor.
|
| Avoiding to online CPUs at runtime will avoid this vulnerability.
GUI daemon issue
-----------------
Qubes OS's GUI virtualization enforces colored borders around all VM
windows. There are two types of windows. The first type are normal
windows (with borders, noscriptbars, etc.). In this case, we modify the
window manager to take care of coloring the borders. The second type are
borderless windows (with the override_redirect property set to True in
X11 terminology). Here, the window manager is not involved at all, and
our GUI daemon needs to draw a border itself. This is done by drawing a
2px border whenever window content is changed beneath that area. The bug
was that if the VM application had never sent any updates for (any part
of) the border area, the frame was never drawn. The relevant code is in
the gui-daemon component [09], specifically in gui-daemon/xside.c [10]:
/* update given fragment of window image
* can be requested by VM (MSG_SHMIMAGE) and Xserver (XExposeEvent)
* parameters are not sanitized earlier - we must check it carefully
* also do not let to cover forced colorful frame (for undecoraded windows)
*/
static void do_shm_update(Ghandles * g, struct windowdata *vm_window,
int untrusted_x, int untrusted_y, int untrusted_w,
int untrusted_h)
{
/* ... */
if (!vm_window->image && !(g->screen_window && g->screen_window->image))
return;
/* force frame to be visible: */
/* * left */
delta = border_width - x;
if (delta > 0) {
w -= delta;
x = border_width;
do_border = 1;
}
/* * right */
delta = x + w - (vm_window->width - border_width);
if (delta > 0) {
w -= delta;
do_border = 1;
}
/* * top */
delta = border_width - y;
if (delta > 0) {
h -= delta;
y = border_width;
do_border = 1;
}
/* * bottom */
delta = y + h - (vm_window->height - border_width);
if (delta > 0) {
h -= delta;
do_border = 1;
}
/* ... */
}
The above code is responsible for deciding whether the colored border
needs to be updated. It is updated if both:
a) there is any window image (vm_window->image)
b) the updated area includes a border anywhere
If neither of these conditions is met, no border is drawn. Note that if
the VM tries to draw anything there (for example, a fake border in a
different color), whatever is drawn will be overridden with the correct
borders, which will stay there until the window is destroyed.
Eric Larsson discovered that this situation (not updating the border
area) is reachable -- and even happens with some real world applications
-- when the VM shows a splash screen with a custom shape. While custom
window shapes are not supported in Qubes OS, VMs do not know this. The
VM still thinks the custom-shaped window is there, so it does not send
| interrupt denoscriptor in the Interrupt Denoscriptor Table (IDT). That
| field selects an entry from the active Task State Segment (TSS).
|
| Since, on AMD hardware, Xen switches to an HVM guest's TSS before
| actually entering the guest, with the Global Interrupt Flag still set,
| the selectors in the IDT entry are switched when guest context is
| loaded/unloaded.
|
| When a new CPU is brought online, its IDT is copied from CPU0's IDT,
| including those selector fields. If CPU0 happens at that moment to be
| in HVM context, wrong values for those IDT fields would be installed
| for the new CPU. If the first guest vCPU to be run on that CPU
| belongs to a PV guest, it will then have the ability to escalate its
| privilege or crash the hypervisor.
|
| A malicious or buggy x86 PV guest could escalate its privileges or
| crash the hypervisor.
|
| Avoiding to online CPUs at runtime will avoid this vulnerability.
GUI daemon issue
-----------------
Qubes OS's GUI virtualization enforces colored borders around all VM
windows. There are two types of windows. The first type are normal
windows (with borders, noscriptbars, etc.). In this case, we modify the
window manager to take care of coloring the borders. The second type are
borderless windows (with the override_redirect property set to True in
X11 terminology). Here, the window manager is not involved at all, and
our GUI daemon needs to draw a border itself. This is done by drawing a
2px border whenever window content is changed beneath that area. The bug
was that if the VM application had never sent any updates for (any part
of) the border area, the frame was never drawn. The relevant code is in
the gui-daemon component [09], specifically in gui-daemon/xside.c [10]:
/* update given fragment of window image
* can be requested by VM (MSG_SHMIMAGE) and Xserver (XExposeEvent)
* parameters are not sanitized earlier - we must check it carefully
* also do not let to cover forced colorful frame (for undecoraded windows)
*/
static void do_shm_update(Ghandles * g, struct windowdata *vm_window,
int untrusted_x, int untrusted_y, int untrusted_w,
int untrusted_h)
{
/* ... */
if (!vm_window->image && !(g->screen_window && g->screen_window->image))
return;
/* force frame to be visible: */
/* * left */
delta = border_width - x;
if (delta > 0) {
w -= delta;
x = border_width;
do_border = 1;
}
/* * right */
delta = x + w - (vm_window->width - border_width);
if (delta > 0) {
w -= delta;
do_border = 1;
}
/* * top */
delta = border_width - y;
if (delta > 0) {
h -= delta;
y = border_width;
do_border = 1;
}
/* * bottom */
delta = y + h - (vm_window->height - border_width);
if (delta > 0) {
h -= delta;
do_border = 1;
}
/* ... */
}
The above code is responsible for deciding whether the colored border
needs to be updated. It is updated if both:
a) there is any window image (vm_window->image)
b) the updated area includes a border anywhere
If neither of these conditions is met, no border is drawn. Note that if
the VM tries to draw anything there (for example, a fake border in a
different color), whatever is drawn will be overridden with the correct
borders, which will stay there until the window is destroyed.
Eric Larsson discovered that this situation (not updating the border
area) is reachable -- and even happens with some real world applications
-- when the VM shows a splash screen with a custom shape. While custom
window shapes are not supported in Qubes OS, VMs do not know this. The
VM still thinks the custom-shaped window is there, so it does not send
updates of content outside of that custom shape.
We fixed the issue by forcing an update of the whole window before
making it visible:
static void handle_map(Ghandles * g, struct windowdata *vm_window)
{
/* ... */
/* added code */
if (vm_window->override_redirect) {
/* force window update to draw colorful frame, even when VM have not
* sent any content yet */
do_shm_update(g, vm_window, 0, 0, vm_window->width, vm_window->height);
}
(void) XMapWindow(g->display, vm_window->local_winid);
}
This needs some auxiliary changes in the do_shm_update function, to draw
the frame also in cases when there is no window content yet
(vm_window->image is NULL).
Commentary from the Qubes Security Team
========================================
For the most part, this batch of Xen Security Advisories affects Qubes
OS 3.2 only theoretically. In the case of Qubes OS 4.0, half of them do
not apply at all. We'll comment briefly on each one:
XSA-237 - The impact is believed to be denial of service only. In addition,
we believe proper use of Interrupt Remapping should offer a generic
solution to similar problems, to reduce them to denial of
service at worst.
XSA-238 - The stated impact is denial of service only.
XSA-239 - The attacking domain has no control over what information
is leaked.
XSA-240 - The practical impact is believed to be denial of service (and does not
affect HVMs).
XSA-241 - The issue applies only to PV domains, so the attack vector
is largely limited in Qubes OS 4.0, which uses HVM domains
by default. In addition, the Xen Security Team considers this
bug to be hard to exploit in practice (see advisory).
XSA-242 - The stated impact is denial of service only. In addition, the
issue applies only to PV domains.
XSA-243 - The practical impact is believed to be denial of service. In addition,
the vulnerable code (shadow page tables) is build-time disabled
in Qubes OS 4.0.
XSA-244 - The vulnerable code path (runtime CPU hotplug) is not used
in Qubes OS.
These results reassure us that switching to HVM domains in Qubes OS 4.0
was a good decision.
Compromise Recovery
====================
Starting with Qubes 3.2, we offer Paranoid Backup Restore Mode, which
was designed specifically to aid in the recovery of a (potentially)
compromised Qubes OS system. Thus, if you believe your system might have
been compromised (perhaps because of the bugs discussed in this
bulletin), then you should read and follow the procedure described here:
https://www.qubes-os.org/news/2017/04/26/qubes-compromise-recovery/
Patching
=========
The specific packages that resolve the problems discussed in this
bulletin are as follows:
For Qubes 3.2:
- Xen packages, version 4.6.6-32
- qubes-gui-dom0, version 3.2.12
For Qubes 4.0:
- Xen packages, version 4.8.2-6
- qubes-gui-dom0, version 4.0.5
The packages are to be installed in dom0 via the Qubes VM Manager or via
the qubes-dom0-update command as follows:
For updates from the stable repository (not immediately available):
$ sudo qubes-dom0-update
For updates from the security-testing repository:
$ sudo qubes-dom0-update --enablerepo=qubes-dom0-security-testing
A system restart will be required afterwards.
These packages will migrate from the security-testing repository to the
current (stable) repository over the next two weeks after being tested
by the community.
If you use Anti Evil Maid, you will need to reseal your secret
passphrase to new PCR values, as PCR18+19 will change due to the new
Xen binaries.
Credits
========
The GUI daemon issue was discovered by Eric Larsson.
The PCI MSI issues were discovered by Simon Gaiser (aka HW42).
For other issues, see the original Xen Security Advisories.
References
===========
We fixed the issue by forcing an update of the whole window before
making it visible:
static void handle_map(Ghandles * g, struct windowdata *vm_window)
{
/* ... */
/* added code */
if (vm_window->override_redirect) {
/* force window update to draw colorful frame, even when VM have not
* sent any content yet */
do_shm_update(g, vm_window, 0, 0, vm_window->width, vm_window->height);
}
(void) XMapWindow(g->display, vm_window->local_winid);
}
This needs some auxiliary changes in the do_shm_update function, to draw
the frame also in cases when there is no window content yet
(vm_window->image is NULL).
Commentary from the Qubes Security Team
========================================
For the most part, this batch of Xen Security Advisories affects Qubes
OS 3.2 only theoretically. In the case of Qubes OS 4.0, half of them do
not apply at all. We'll comment briefly on each one:
XSA-237 - The impact is believed to be denial of service only. In addition,
we believe proper use of Interrupt Remapping should offer a generic
solution to similar problems, to reduce them to denial of
service at worst.
XSA-238 - The stated impact is denial of service only.
XSA-239 - The attacking domain has no control over what information
is leaked.
XSA-240 - The practical impact is believed to be denial of service (and does not
affect HVMs).
XSA-241 - The issue applies only to PV domains, so the attack vector
is largely limited in Qubes OS 4.0, which uses HVM domains
by default. In addition, the Xen Security Team considers this
bug to be hard to exploit in practice (see advisory).
XSA-242 - The stated impact is denial of service only. In addition, the
issue applies only to PV domains.
XSA-243 - The practical impact is believed to be denial of service. In addition,
the vulnerable code (shadow page tables) is build-time disabled
in Qubes OS 4.0.
XSA-244 - The vulnerable code path (runtime CPU hotplug) is not used
in Qubes OS.
These results reassure us that switching to HVM domains in Qubes OS 4.0
was a good decision.
Compromise Recovery
====================
Starting with Qubes 3.2, we offer Paranoid Backup Restore Mode, which
was designed specifically to aid in the recovery of a (potentially)
compromised Qubes OS system. Thus, if you believe your system might have
been compromised (perhaps because of the bugs discussed in this
bulletin), then you should read and follow the procedure described here:
https://www.qubes-os.org/news/2017/04/26/qubes-compromise-recovery/
Patching
=========
The specific packages that resolve the problems discussed in this
bulletin are as follows:
For Qubes 3.2:
- Xen packages, version 4.6.6-32
- qubes-gui-dom0, version 3.2.12
For Qubes 4.0:
- Xen packages, version 4.8.2-6
- qubes-gui-dom0, version 4.0.5
The packages are to be installed in dom0 via the Qubes VM Manager or via
the qubes-dom0-update command as follows:
For updates from the stable repository (not immediately available):
$ sudo qubes-dom0-update
For updates from the security-testing repository:
$ sudo qubes-dom0-update --enablerepo=qubes-dom0-security-testing
A system restart will be required afterwards.
These packages will migrate from the security-testing repository to the
current (stable) repository over the next two weeks after being tested
by the community.
If you use Anti Evil Maid, you will need to reseal your secret
passphrase to new PCR values, as PCR18+19 will change due to the new
Xen binaries.
Credits
========
The GUI daemon issue was discovered by Eric Larsson.
The PCI MSI issues were discovered by Simon Gaiser (aka HW42).
For other issues, see the original Xen Security Advisories.
References
===========
[01] https://xenbits.xen.org/xsa/advisory-237.html
[02] https://xenbits.xen.org/xsa/advisory-238.html
[03] https://xenbits.xen.org/xsa/advisory-239.html
[04] https://xenbits.xen.org/xsa/advisory-240.html
[05] https://xenbits.xen.org/xsa/advisory-241.html
[06] https://xenbits.xen.org/xsa/advisory-242.html
[07] https://xenbits.xen.org/xsa/advisory-243.html
[08] https://xenbits.xen.org/xsa/advisory-244.html
[09] https://github.com/QubesOS/qubes-gui-daemon/
[10] https://github.com/QubesOS/qubes-gui-daemon/blob/master/gui-daemon/xside.c#L1317-L1447
--
The Qubes Security Team
https://www.qubes-os.org/security/
[02] https://xenbits.xen.org/xsa/advisory-238.html
[03] https://xenbits.xen.org/xsa/advisory-239.html
[04] https://xenbits.xen.org/xsa/advisory-240.html
[05] https://xenbits.xen.org/xsa/advisory-241.html
[06] https://xenbits.xen.org/xsa/advisory-242.html
[07] https://xenbits.xen.org/xsa/advisory-243.html
[08] https://xenbits.xen.org/xsa/advisory-244.html
[09] https://github.com/QubesOS/qubes-gui-daemon/
[10] https://github.com/QubesOS/qubes-gui-daemon/blob/master/gui-daemon/xside.c#L1317-L1447
--
The Qubes Security Team
https://www.qubes-os.org/security/
RT @kylerankin: There's a reason @QubesOS marks the network VM as untrusted. Safer to treat your network that way #KRACK or not.
A Brief Introduction to the Xen Project and Virtualization from Mohsen Mostafa Jokar
https://blog.xenproject.org/2017/10/17/a-brief-introduction-to-the-xen-project-and-virtualization-from-mohsen-mostafa-jokar/
Mohsen Mostafa Jokar is a Linux administrator who works at the newspaper Hamshahri as a network and virtualization administrator. His interest in virtualization goes back to when he was at school and saw a Microsoft Virtual PC for the first time. He installed it on a PC with 256 MB of RAM and used it […]
https://blog.xenproject.org/2017/10/17/a-brief-introduction-to-the-xen-project-and-virtualization-from-mohsen-mostafa-jokar/
Mohsen Mostafa Jokar is a Linux administrator who works at the newspaper Hamshahri as a network and virtualization administrator. His interest in virtualization goes back to when he was at school and saw a Microsoft Virtual PC for the first time. He installed it on a PC with 256 MB of RAM and used it […]
MSI support for PCI device pass-through with stub domains
https://www.qubes-os.org/news/2017/10/18/msi-support/
Introduction
In this post, we will describe how we fixed MSI support for VMs running in HVM mode in Qubes 4.0.
First, allow us to provide some background about the MSI feature and why we need it in the first place.
In Qubes 4.0, we switched from paravirtualized (PV) virtual machines to hardware virtual machines (HVMs, also known as “fully virtualized” or “hardware-assisted” VMs) for improved security (see the 4.0-rc1 announcement (https://www.qubes-os.org/news/2017/07/31/qubes-40-rc1/#fully-virtualized-vms-for-better-isolation) for details).
For VMs running as HVMs, Xen requires software that can emulate hardware (such as network cards) called QEMU in order to provide device emulation.
By default, Xen runs QEMU in the most trusted domain, dom0, and QEMU has quite a large attack surface.
Running QEMU in dom0 would jeopardize the security of Qubes, so it is necessary to run QEMU outside of dom0.
We do this by using a Xen feature that allows us to run QEMU inside a second “helper” VM called a “stub domain”.*
This way, an attacker who exploits a bug in QEMU will be confined to the stub domain rather than getting full access to dom0.
Admittedly, stub domains run in PV mode, which means that an attacker who were to successfully exploit QEMU would gain the ability to exploit potential Xen bugs in paravirtualization.
Nonetheless, we believe using HVMs to host PCI devices is still a considerable improvement.
Of course, in the long term, we would like to switch to using PVH VMs, but at the moment this is not feasible.
In our testing, we found that pass-through PCI devices did not work in HVMs on some machines.
On the affected machines, networking devices and USB devices, for example, were not usable as they are in Qubes 3.2.
(The kernel driver failed to initialize the device.)
This was a major problem that would have blocked us from moving entirely from PV to HVM in Qubes 4.0.
For this reason, the Qubes 4.0-rc1 installer configures all VMs that have attached PCI devices to use PV mode so that those PCI devices will function correctly.
Problems
After much further testing, we discovered that the affected PCI devices don’t work without MSI support.
(MSI is a method to trigger an interrupt from a PCI device.)
The devices we observed to be problematic were all newer Intel devices (integrated USB controllers and a Wi-Fi card).
While the PCIe standard allows for devices that don’t support legacy interrupts, all the affected devices advertised support for legacy interrupts.
But no interrupts were ever delivered after the driver configured the device.
This made the bug tricky to track down, since we were looking for an error on the software side.
To get those devices working, we needed MSI support.
When running QEMU in dom0, MSI support (and therefore the problematic devices) worked, but with stub domains, it was broken.
This is why, until now, we’ve had patches in place to hide MSI capability from the guest so that the driver doesn’t try to use it (one patch for the Mini-OS-based stub domain (https://github.com/QubesOS/qubes-vmm-xen/blob/ff5eaaa777e9d6ba42242479d1cabacfbdc728ca/patches.misc/hvmpt02-disable-msi-caps.patch) and another for the new Linux-based stub domain (https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/blob/71a01b41a9cf69d580c652a7147c0a8eb33ced97/qemu/patches/disable-msi-caps.patch)).
We found two issues that were preventing MSI support from working with stub domains.
First, the stub domain did not have the required permission on the IRQ, which is reserved for the MSI in the map_pirq hypercall QEMU makes.
(The IRQ is basically a number to distinguish between interrupts from different devices.)
https://www.qubes-os.org/news/2017/10/18/msi-support/
Introduction
In this post, we will describe how we fixed MSI support for VMs running in HVM mode in Qubes 4.0.
First, allow us to provide some background about the MSI feature and why we need it in the first place.
In Qubes 4.0, we switched from paravirtualized (PV) virtual machines to hardware virtual machines (HVMs, also known as “fully virtualized” or “hardware-assisted” VMs) for improved security (see the 4.0-rc1 announcement (https://www.qubes-os.org/news/2017/07/31/qubes-40-rc1/#fully-virtualized-vms-for-better-isolation) for details).
For VMs running as HVMs, Xen requires software that can emulate hardware (such as network cards) called QEMU in order to provide device emulation.
By default, Xen runs QEMU in the most trusted domain, dom0, and QEMU has quite a large attack surface.
Running QEMU in dom0 would jeopardize the security of Qubes, so it is necessary to run QEMU outside of dom0.
We do this by using a Xen feature that allows us to run QEMU inside a second “helper” VM called a “stub domain”.*
This way, an attacker who exploits a bug in QEMU will be confined to the stub domain rather than getting full access to dom0.
Admittedly, stub domains run in PV mode, which means that an attacker who were to successfully exploit QEMU would gain the ability to exploit potential Xen bugs in paravirtualization.
Nonetheless, we believe using HVMs to host PCI devices is still a considerable improvement.
Of course, in the long term, we would like to switch to using PVH VMs, but at the moment this is not feasible.
In our testing, we found that pass-through PCI devices did not work in HVMs on some machines.
On the affected machines, networking devices and USB devices, for example, were not usable as they are in Qubes 3.2.
(The kernel driver failed to initialize the device.)
This was a major problem that would have blocked us from moving entirely from PV to HVM in Qubes 4.0.
For this reason, the Qubes 4.0-rc1 installer configures all VMs that have attached PCI devices to use PV mode so that those PCI devices will function correctly.
Problems
After much further testing, we discovered that the affected PCI devices don’t work without MSI support.
(MSI is a method to trigger an interrupt from a PCI device.)
The devices we observed to be problematic were all newer Intel devices (integrated USB controllers and a Wi-Fi card).
While the PCIe standard allows for devices that don’t support legacy interrupts, all the affected devices advertised support for legacy interrupts.
But no interrupts were ever delivered after the driver configured the device.
This made the bug tricky to track down, since we were looking for an error on the software side.
To get those devices working, we needed MSI support.
When running QEMU in dom0, MSI support (and therefore the problematic devices) worked, but with stub domains, it was broken.
This is why, until now, we’ve had patches in place to hide MSI capability from the guest so that the driver doesn’t try to use it (one patch for the Mini-OS-based stub domain (https://github.com/QubesOS/qubes-vmm-xen/blob/ff5eaaa777e9d6ba42242479d1cabacfbdc728ca/patches.misc/hvmpt02-disable-msi-caps.patch) and another for the new Linux-based stub domain (https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/blob/71a01b41a9cf69d580c652a7147c0a8eb33ced97/qemu/patches/disable-msi-caps.patch)).
We found two issues that were preventing MSI support from working with stub domains.
First, the stub domain did not have the required permission on the IRQ, which is reserved for the MSI in the map_pirq hypercall QEMU makes.
(The IRQ is basically a number to distinguish between interrupts from different devices.)