(Switching to English because searching for the error code yields no results. International readers will likely find this thread.)
I’m stuck with my investigation. Maybe someone else can help.
What happened so far
Once in a few weeks, my game freezes the system upon completing the splash screen. Happens on Windows 10 1909 x64 as well as Windows 7 x86. The screen freezes and I cannot move the cursor. A few seconds later, the system comes back to life and acts as if nothing happened.
Windows Error Reporting writes an Event Log entry like the following:
Fault bucket , type 0
Event Name: LiveKernelEvent
Response: Not available
Cab Id: 0
Problem signature:
P1: 190
P2: 1
P3: ███████████████
P4: ███████████████
P5: ███████████████
P6: 10_0_18363
P7: 0_0
P8: 768_1
P9:
P10:
Attached files:
\\?\C:\WINDOWS\LiveKernelReports\win32k.sys\win32k.sys-████████-████.dmp
\\?\C:\WINDOWS\TEMP\WER-██████████-0.sysdata.xml
\\?\C:\WINDOWS\LiveKernelReports\win32k.sys-████████-████.dmp
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER████.tmp.WERInternalMetadata.xml
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER████.tmp.xml
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER████.tmp.csv
\\?\C:\ProgramData\Microsoft\Windows\WER\Temp\WER████.tmp.txt
These files may be available here:
\\?\C:\ProgramData\Microsoft\Windows\WER\ReportQueue\Kernel_██_█████████████████████████████████████████████████████████████████████████████████████
Analysis symbol:
Rechecking for solution: 0
Report Id: ████████-████-████-████-████████████
Report Status: 4
Hashed bucket:
Cab Guid: 0
I saved both crash dumps I could find (one 2.5 GiB, one 3.0). I won’t upload them because they contain personal data like encryption keys.
I open the first dump in windbg and type
!analyze -v. The result is:
Code: Alles auswählen
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
WIN32K_CRITICAL_FAILURE_LIVEDUMP (190)
Win32k has encountered a critical failure.
Live dump is captured to collect the debug information.
Arguments:
Arg1: 0000000000000001, REGION_VALIDATION_FAILURE
Region is out of surface bounds.
Arg2: fffff9564061b010, Pointer to DC
Arg3: fffff9564b2e5b00, Pointer to SURFACE
Arg4: fffff95646d48130, Pointer to REGION
Debugging Details:
------------------
KEY_VALUES_STRING: 1
Key : Analysis.CPU.Sec
Value: 9
Key : Analysis.DebugAnalysisProvider.CPP
Value: Create: 8007007e on ████████
Key : Analysis.DebugData
Value: CreateObject
Key : Analysis.DebugModel
Value: CreateObject
Key : Analysis.Elapsed.Sec
Value: 29
Key : Analysis.Memory.CommitPeak.Mb
Value: 75
Key : Analysis.System
Value: CreateObject
DUMP_FILE_ATTRIBUTES: 0x10
Live Generated Dump
BUGCHECK_CODE: 190
BUGCHECK_P1: 1
BUGCHECK_P2: fffff9564061b010
BUGCHECK_P3: fffff9564b2e5b00
BUGCHECK_P4: fffff95646d48130
PROCESS_NAME: TFXplorer x64.exe
STACK_TEXT:
ffff820d`29faa1b0 fffff807`5939c167 : ffffffff`ffffffff 00000000`00000211 00000000`00000211 00000000`00000008 : nt!IopLiveDumpEndMirroringCallback+0x7e
ffff820d`29faa200 fffff807`593a8405 : 00000000`00000000 fffff807`00000000 ffffa802`00000001 00000000`00000001 : nt!MmDuplicateMemory+0x3cb
ffff820d`29faa2a0 fffff807`5965923b : ffffa802`e47e9c30 ffffa802`e47e9c30 ffff820d`29faa578 ffff820d`29faa578 : nt!IopLiveDumpCaptureMemoryPages+0x79
ffff820d`29faa360 fffff807`5964c29f : 00000000`00000000 ffff930e`29ff0450 ffff930e`3677adb0 ffff930e`29ff0450 : nt!IoCaptureLiveDump+0x2e7
ffff820d`29faa510 fffff807`5964c9c8 : ffffffff`80002080 00000000`00000000 00000000`00000000 00000000`00000190 : nt!DbgkpWerCaptureLiveFullDump+0x137
ffff820d`29faa570 fffff807`5964c0f1 : 00000000`00000002 00000000`00000000 fffff914`10aab830 fffff914`00000000 : nt!DbgkpWerProcessPolicyResult+0x30
ffff820d`29faa5a0 fffff914`1098be9d : 00000000`00000315 fffff956`46d48130 fffff956`4061b010 00000000`00000000 : nt!DbgkWerCaptureLiveKernelDump+0x1a1
ffff820d`29faa5f0 fffff914`108d72bb : 00000000`00000000 ffff820d`29faa6e0 00000000`00000000 ffff820d`29faa730 : win32kbase!GrepValidateVisRgn+0xb150d
ffff820d`29faa690 fffff914`108d7070 : ffff820d`29faa850 fffff956`4061b010 00000000`00040064 ffffa802`de7f91f0 : win32kbase!GreSelectVisRgnInternal+0x9f
ffff820d`29faa710 fffff914`0fcecc1f : 00000000`00040064 ffffffff`94052614 00000000`00040064 ffffffe6`ffffffa8 : win32kbase!GreSelectVisRgn+0x40
ffff820d`29faa750 fffff914`0fcd66ee : 00000000`00040064 fffff956`467d2d20 00000000`00040062 00000000`00000000 : win32kfull!UpdateSpriteArea+0x13f
ffff820d`29faa8b0 fffff914`0fcd7f15 : fffff956`474ea0d0 00000000`00000001 fffff956`478eb460 fffff956`477f3550 : win32kfull!zzzBltValidBits+0x7a2
ffff820d`29faa9d0 fffff914`0fc2c877 : 00000000`00000001 00000000`00000000 00000000`00000001 00000000`00000001 : win32kfull!xxxEndDeferWindowPosEx+0x445
ffff820d`29faaab0 fffff807`58fd4558 : ffffa802`e55ba080 00000000`00000742 00000000`00000742 ffffa802`e6af5820 : win32kfull!NtUserEndDeferWindowPosEx+0x97
ffff820d`29faab00 00007ff8`05eb1564 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x28
00000000`0057bcb8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007ff8`05eb1564
SYMBOL_NAME: win32kbase!GrepValidateVisRgn+b150d
MODULE_NAME: win32kbase
IMAGE_NAME: win32kbase.sys
STACK_COMMAND: .thread ; .cxr ; kb
BUCKET_ID_FUNC_OFFSET: b150d
FAILURE_BUCKET_ID: LKD_0x190_win32kbase!GrepValidateVisRgn
OS_VERSION: 10.0.18362.1
BUILDLAB_STR: 19h1_release
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
FAILURE_ID_HASH: {████████-████-████-████-████████████}
Followup: MachineOwner
---------
This gives us, indeed, the most important information:
- PROCESS_NAME: TFXplorer x64.exe
That’s my game!
- Win32k has encountered a critical failure. Win32k is the graphics subsystem of the kernel. It renders windows and menus, delegates work to the graphics driver, etc. We hit a critical failure there.
- Call Stack:
nt!DbgkWerCaptureLiveKernelDump+0x1a1
win32kbase!GrepValidateVisRgn+0xb150d
win32kbase!GreSelectVisRgnInternal+0x9f
win32kbase!GreSelectVisRgn+0x40
win32kfull!UpdateSpriteArea+0x13f
win32kfull!zzzBltValidBits+0x7a2
win32kfull!xxxEndDeferWindowPosEx+0x445
xxxEndDeferWindowPosEx() is the kernel-mode call to EndDeferWindowPos(), which moves a number of windows to a new position. As we can see, this call identified the visible region of a window and tries to copy its pixels to the new position (so that the window doesn’t have to be redrawn entirely).
- Bugcheck:
Arg1: 0000000000000001, REGION_VALIDATION_FAILURE
Region is out of surface bounds.
Arg2: fffff9564061b010, Pointer to DC
Arg3: fffff9564b2e5b00, Pointer to SURFACE
Arg4: fffff95646d48130, Pointer to REGION
Before copying pixels, the kernel does some sanity checking – are the pixels there in the first place? Here, they are not. The visible region determined by zzzBltValidBits() cannot be found on the surface and rather than copying garbage, it dumps core.
In short, I made the kernel copy pixels which aren’t there. How could this happen?
The obvious first step is reducing my program to a minimal reproducable sample. But this isn’t possible here: I collected the first crash dump on 2020-10-24. Then I ran the program about a thousand times(!) and nothing happened. The second dump was written a week later, on 2020-10-31.
I tried to find out what exactly was causing the error, but the dump does not include user-mode information, as can be seen by
Kernel Bitmap Dump File: Kernel address space is available, User address space may not be available. when windbg opens the dump.
I wanted to make
DbgkWerCaptureLiveKernelDump() collect user-mode information, but instead I found
the following culprit:
Clearly the code is trying to avoid generating too many Live Kernel Reports and thus has put some arbitrary deadline in the future for the next time a dump may be written. I searched but could not find a way to bypass this (i.e. a, “no, please do wear out my SSD and fill my drive, I don’t care!” flag) but alas did not find one. This dashed any and all hopes I had for using this as a testing mechanism in the lab.
The sample in the article uses a time span of four days. So I could be hitting this bug all the time but only once every few weeks the kernel would actually dump core. So much for reducing the problem.
Instead, let’s have a look at the invalid region/surface and try to recognize them in my game.
Here’s a recording of my game without it hanging:
(enable non-https content in your browser if you can’t see anything)
First comes the splash screen and checks whether Direct3D and XAudio are available, etc. The splash screen is then scrolled out to the left and the main page is scrolled in from the right. The bugcheck occurs when the page transition starts; i.e. the splash screen remains frozen. When the system comes back a few seconds later, the transition is already over.
Finding the surface
Let’s find out which surface is garbage.
Arg3: fffff9564b2e5b00, Pointer to SURFACE
Unfortunately, there is no type for this address. I tried a number of explicit types (via
dt) but they didn’t work either. Looks like the surface structure is just not included in the Win32k debug symbols.
I googled quite a bit and eventually found definitions here:
https://blog.quarkslab.com/reverse-engi ... ation.html
Code: Alles auswählen
typedef struct _BASEOBJECT {
HANDLE hHmgr;
ULONG ulShareCount;
USHORT cExclusiveLock;
USHORT BaseFlags;
PVOID Tid;
} BASEOBJECT, *PBASEOBJECT;
typedef struct _SURFOBJ {
DHSURF dhsurf;
HSURF hsurf;
DHPDEV dhpdev;
HDEV hdev;
SIZEL sizlBitmap;
ULONG cjBits;
PVOID pvBits;
PVOID pvScan0;
LONG lDelta;
ULONG iUniq;
ULONG iBitmapFormat;
USHORT iType;
USHORT fjBitmap;
} SURFOBJ, *PSURFOBJ;
These match quite well:
Code: Alles auswählen
fffff956`4b2e5b00 14 26 05 94 ff ff ff ff 03 00 00 00 00 00 00 40 .&.............@ _BASEOBJECT.handle 14 26 05 94 ff ff ff ff
_BASEOBJECT.ulShareCount 03 00 00 00
_BASEOBJECT.cExclusiveLock 00 00
_BASEOBJECT.BaseFlags 00 40
fffff956`4b2e5b10 80 a0 5b e5 02 a8 ff ff 90 da 3b 47 56 f9 ff ff ..[.......;GV... _BASEOBJECT.Tid 80 a0 5b e5 02 a8 ff ff
_SURFOBJ.dhsurf 90 da 3b 47 56 f9 ff ff
fffff956`4b2e5b20 14 26 05 94 ff ff ff ff 20 e0 31 47 56 f9 ff ff .&...... .1GV... _SURFOBJ.hsurf 14 26 05 94 ff ff ff ff
_SURFOBJ.dhpdev 14 26 05 94 ff ff ff ff
fffff956`4b2e5b30 00 10 91 40 56 f9 ff ff 52 07 00 00 15 03 00 00 ...@V...R....... _SURFOBJ.hdev 00 10 91 40 56 f9 ff ff
_SURFOBJ.sizlBitmap.x == 1874
_SURFOBJ.sizlBitmap.y == 789 – 1874×789 ~ size of main window (a little larger even)
fffff956`4b2e5b40 e8 3e 5a 00 00 00 00 00 00 00 00 00 00 00 00 00 .>Z............. _SURFOBJ.cjBits e8 3e 5a 00 == 4 * width * height (32-bit color depth)
_SURFOBJ.pvBits nullptr
_SURFOBJ.pvScan0 nullptr
fffff956`4b2e5b50 00 00 00 00 00 00 00 00 00 00 00 00 be b7 27 01 ..............'. _SURFOBJ.lDelta 00 00 00 00
_SURFOBJ.iUniq 00 00 00 00
_SURFOBJ.iBitmapFormat 00 00 00 00
_SURFOBJ.iType be b7
_SURFOBJ.fjBitmap 27 01
The symbols do seem to fall apart at
iType. I’d be glad if someone could provide me with the actual symbols …
But anyway, we see a bitmap of 1874×789 pixels and 32-bit color depth. That matches the main window pretty well. (A few extra pixels for Aero’s shadow effects, maybe?)
We also see that
_SURFOBJ.pvBits and
_SURFOBJ.pvScan0 are both zero. That’s not good, I guess, because MSDN says one of them should always be set.
I interpret the data as: “The kernel tries to copy pixels from the main window, but the main window doesn’t have a bitmap allocated”. Which is very strange, because we do see that window on-screen. But we know we are in a strange place, because the kernel wouldn’t panic otherwise.
Finding the region
I had even less luck looking for a definition of
REGION. Instead, I just interpreted the stuff as 4-B numbers:
Code: Alles auswählen
fffff956`46d48130 64 00 04 00 00 00 00 00 00 00 00 00 01 00 00 80 d...............
fffff956`46d48140 80 a0 5b e5 02 a8 ff ff d8 00 00 00 00 00 00 00 ..[.............
fffff956`46d48150 00 00 00 00 00 00 00 00 08 82 d4 46 56 f9 ff ff ...........FV...
fffff956`46d48160 60 81 d4 46 56 f9 ff ff 60 81 d4 46 56 f9 ff ff `..FV...`..FV...
fffff956`46d48170 30 81 d4 46 56 f9 ff ff e0 4c 9e 10 14 f9 ff ff 0..FV....L......
fffff956`46d48180 d8 00 00 00 05 00 00 00 ea ff ff ff 4f 00 00 00 ............O... | 216 | 5 | -22 | 79 | << d8 = 216; 4f 00 00 00 == 0x0000004f == 79
fffff956`46d48190 2c 07 00 00 0d 03 00 00 00 00 00 00 00 00 00 80 ,............... | 1836 | 781 | 0 |INTMIN|
fffff956`46d481a0 4f 00 00 00 00 00 00 00 02 00 00 00 4f 00 00 00 O...........O... | 79 | 0 | 2 | 79 |
fffff956`46d481b0 fe 01 00 00 ea ff ff ff 2c 07 00 00 02 00 00 00 ........,....... | 510 | -22 | 1836 | 2 |
fffff956`46d481c0 04 00 00 00 fe 01 00 00 46 02 00 00 ea ff ff ff ........F....... | 4 | 510 | 582 | -22 | << fe 01 == 0x01fe == 510; 46 02 == 0x0246 == 582; ea ff ff ff == 0xffffffea == -22;
fffff956`46d481d0 db 02 00 00 1b 05 00 00 2c 07 00 00 04 00 00 00 ........,....... | 731 | 1307| 1836 | 4 | << 2c 07 == 0x72c == 1836
fffff956`46d481e0 02 00 00 00 46 02 00 00 0d 03 00 00 ea ff ff ff ....F........... | 2 | 582 | 781 | -22 |
fffff956`46d481f0 2c 07 00 00 02 00 00 00 00 00 00 00 0d 03 00 00 ,............... | 1836 | 2 | 0 | 781 |
fffff956`46d48200 ff ff ff 7f 00 00 00 00 ff ff ff 7f 00 00 00 00 ................ |INTMAX| 0 |INTMAX| 0 |
fffff956`46d48210 00 00 0f 23 47 6c 61 34 00 00 00 00 00 00 00 00 ...#Gla4........
#Gla at the end marks the beginning of a new kernel object (global lookaside table). The numbers 1836 and 781 appear quite often. 781 is exactly the height of my main window (no shadows or anything). 1836 is smaller by 24 pixels, and the other common number is -22.
If you draw it over my screenshot, you can see that the coordinates (-22, 79) and (1836, 781) correspond
exactly to the old page after scrolling out a few pixels to the left.
FWIW, the coordinates in both crash dumps are almost identical. This is suspicious – the amount of scrolling in the animation depends on the timing of the next monitor refresh.
What exactly happened?
I can now reconstruct pretty accurately what happened in my game, I just cannot explain the error. Sorry for just presenting this “as is” and not explaining a lot, but I encounter this problem in a pre-existing program and we just have to take it the way it is.
1. My game starts.
It creates a dialog of 1860×782 pixels. (These dimensions were chosen by Windows as a default based on my monitor resolution.)
It creates five children for the dialog:
- The “Back” button.
- The title label.
- The “Info” button.
- The splash page (a dialog itself with children, e.g. for text labels).
- The main page (a dialog itself with children).
- (More pages for other UI stuff in my game, but this never activates before the bugcheck.)
All these children are created in the top-left corner of their parent’s client area. They are all 0×0 or 1×1 pixels small. The main window is invisible because I don’t want the user to see it building up.
(enable non-https content in your browser if you can’t see anything)
2. The main window is shown. This triggers a resize message to compute the layout.
All children go to the final coordinates you see –
except the main page. (The main page is not visible yet, and so I don’t invest time in its layout and it just sticks around invisibly at (0, 0) with size 1×1 or so.)
The window is drawn for the first time.
3. I put a little twist on the splash screen: It normally starts a transition to the main menu automatically, except if there is an error during initialization. (E.g. the XAudio debug library missing in debug builds.) In this case, it’ll show the according text and you must click “Accept” in the bottom-right corner to proceed anyway (e.g. without sound).
I got one dump with a release version, no game errors and automatic transition to main menu. I got the other dump with a debug version, one error and manual clicking “Accept”.
This tells us that the window is fully functional and interactive at this point.
4. Transition to the main menu starts.
The sub-dialog for the main menu is positioned at the right edge of the main window (via the trinity of the
DeferWindowPos() API. This is a resizing, so its layout gets re-computed. No pixels are visible (the sub-dialog is outside of its parent’s client area), but it now lurks just off the right edge of the window.
5. First animation frame.
The sub-dialog for the splash screen is moved 20–30 pixels to the left, and so is the sub-dialog. Both happens atomically via the trinity of the
DeferWindowPos() API.
The call to
EndDeferWindowPos() hits the bugcheck and the system freezes until the core dump is written to disk. According to our analysis above, the kernel fails to copy the pixels of the old splash screen to its new position after scrolling 24 pixels or so.
What now?
So here we are. Anyone having a suggestion on what to do to find the cause *or* create a minimal example that could be sent to Microsoft for investigation?
Please keep in mind, I have about one shot per week.