| 1. Overview: | |
| no interrupts, no devices, no io | |
| tasks are goroutines | |
| 2. syscall: | |
| sentry can run in non-root(ring0) and root(ring3). | |
| userapp‘s syscall are intercepted like normal guest, and handled by sentry kernel(in non-root mode iif the syscall can be | |
| handled without sentry call syscall on host) | |
| sentry kernel‘s syscall always executed in root(ring3) mode. sentry kernel‘s syscall finally execute HLT, which causes | |
| VM exit. and then in root mode. | |
| basic flow: | |
| bluepill loop util bluepillHandler + setcontext return, now in non-root mode. | |
| user app -> syscall -> sysenter -> user: -> SwitchToUser returns -> syscall | |
| handled by sentry kernel(t.doSyscall()), in non-root mode. | |
| sentry kernel -> syscall -> sysenter -> HLT -> vm exit -> t.doSyscall() [in root | |
| mode]. | |
| 3. memory: | |
| physical memory: | |
| size of physical memory almost equals 1 << cpu.physicalbits, but might be smaller because of reserved region,etc. | |
| vsize - psize part not in physicalRegion. gva <-> gpa, ie | |
| guest pagetable, maps almost all gva <-> gpa, but gpa <-> hva(hpa) | |
| is only set for sentry kernel initially. Then gpa page frame | |
| is filled by HandleUserFault(from filemem or HostFile) each | |
| time there is ept fault.. | |
| pagetables: | |
| gvisor itself is mapped in root and non-root mode, and the gva == hva. So, sentry runs in userspace address space | |
| in root ring3 mode, also run in userspace address space in non-root ring0 mode. | |
| user app: userspace address space(lower part of 64bits address) <--> gpa | |
| kernelspace address space(higher part of 64bits address), which actually | |
| is sentry kernel userspace address with 63th bit set <--> gpa. This | |
| map is almost useless, maybe only for pagetable switch and some setups. | |
| we cannot run sentry on this range of address..(even | |
| PIC cannot work, since PIC will be resolved once, not everytime when | |
| hits). | |
| sentry kernel: userspace address space, which is the userspace address on host. | |
| so, gva actually equals hva. then gva <-> gpa <-> hva. | |
| kernelspace address space is hva with 63th bits set <--> gpa. gpa <--> hva(hpa) | |
| is set using ept. Again, gpa <--> hva is set up for sentry kernel initially. All subsequent | |
| are handled by EPT fault, which eventually causes HandleUserFault(). | |
| From here, we can see, for each user app syscall, there is pagetable switch. | |
| somewhat similary to KPTI. but the pagetable is very different. | |
| Since user app and sentry kernel‘s pagetable probably overlap(use the same userspace address space), they cannot be | |
| mapped at the same time. when syscall, switch to sentry kernel‘s pagetable, there | |
| is no map of user app in the table.. it causes access to user memory complicated.. | |
| (This is why usermem is needed...). unlike linux, kernel‘s pagetable is superset | |
| of user process‘s pagetable, so kernel can access user memory convieniently. | |
| The access to userapp‘s memory from sentry kernel(for example, write syscall for userapp, sentry kernel | |
| have to copy data from userapp‘s memory address space). How to find the sentry kernel‘s addr according to the userapp‘s | |
| addr? Basically, Walkthrough userapp‘s pagetable to get uaddr --> gpa, Or walk userapp‘s vma to findout | |
| uaddr -> file + file offset, the walk userapp‘s address_space to findout file +file offset -> gpa. Then sentry | |
| knows gpa -> hva(it itself maps all the memory, stores the mapping), gets hva.. In sentry, gva == hva, no matter | |
| sentry in root or non-root, both ok to access this hva. | |
| Filesystem: | |
| The thin vfs is in sentry, like linux. Also has limited proc and sys. gofer only for 9pfs. | |
| From code path, all file operations go through 9p server, However From log, ther is no Tread/Twrite message in | |
| 9p server. Topen/Tclunk go through 9p server, assume | |
| that read/write directly to host file, probably fd passed by unix domain socket. | |
| Network: | |
| receive via go routine, tx via endpoint.WritePacket. | |
| Summary: | |
| shortcomings: compatibility, unstable, syscall overhead. eg, mount command causes sudden exit of gvisor, ip command | |
| cannot run, SO_SNDBUF socket option not supported.. | |
| merits: small memory footprints. physical memory be backed up by memfd/physical file(somehow like dax). on demand | |
| memory map, not fixed for the beginning. |
原文:https://www.cnblogs.com/dream397/p/14270544.html