Welcome to MSDN Blogs Sign in | Join | Help

Just a quick note that a detailed presentation about driver porting in CE6 is now available on Channel 9:

http://channel9.msdn.com/posts/TravisHobrla/Porting-Drivers-to-Windows-CE-60/

This presentation was developed by Juggs Ravalia and myself and has been floating around technical conferences (like MEDC) for a couple years.  Now it is finally available online!

Posted by: Russ Keldorph

In my previous post, I talked about how structure packing works.  Now I’d like to talk about when and why it’s commonly used as well as why you may or may not want to use it.  Let me start out by saying that by "structure packing" I'm referring to the use of the /Zp compiler switch or #pragma pack directive to make the packing of a structure something other than the default.  For example, using #pragma pack(2)around a structure containing an int type modifies the structure default packing of 4.  Alternatively, #pragma pack(1) around a structure containing only char (1-byte) types has no effect and is (technically) harmless.

Why use packing?

People usually use structure packing for one of two reasons:

1.       they want to save space in data structures, or

2.       they want to format a stream of bytes into fields according to some existing specification like a network protocol.

These can both be valid reasons, but, more often than not, the implications of a decision to use packing are not fully understood, leading to unforeseen side effects that can, in some cases, have long-term negative consequences.  The point of this post is to identify the costs of packing and suggest best practices around its use.

First, let’s look at a common example of how packing affects code generation.  Take the following C++ code compiled for all four architectures supported by Windows Embedded CE.

// To compile: cl –c –O2 t.cpp –DPACKING=<packing size>

#pragma pack(push, PACKING)

struct S {

    char i8;

    int i32;

};

#pragma pack(pop)

 

int extract(S * ps) {

    return ps->i32;

}

The following table lists the sequences of code required to load the i32 member of S.  Remember that when PACKING=4, padding is inserted such that the i32 member’s offset from the beginning of S is a multiple of its alignment (4).  When PACKING=1, i32’s alignment becomes 1, so no padding is inserted.

 

PACKING=4

PACKING=1

ARM

ldr     r0, [r0, #4]

ldrb    lr, [r0, #1]!

ldrb    r3, [r0, #1]

ldrb    r2, [r0, #2]

ldrb    r1, [r0, #3]

orr     r3, lr, r3, lsl #8

orr     r3, r3, r2, lsl #16

orr     r0, r3, r1, lsl #24

MIPS

lw      v0,4(a0)

addiu   t0,a0,1

lwl     v0,3(t0)

lwr     v0,0(t0)

SuperH

mov.l   @(4,r4),r0

add     #1,r4

mov.b   @(3,r4),r0

mov     r0,r3

mov.b   @(2,r4),r0

shll8   r3

extu.b  r0,r2

mov.b   @(1,r4),r0

or      r3,r2

extu.b  r0,r1

mov.b   @r4,r0

shll8   r2

or      r2,r1

shll8   r1

extu.b  r0,r0

or      r1,r0

x86

mov     eax,dword ptr [eax+4]

mov     eax,dword ptr [eax+1]

 

Notice how the difference packing makes depends a lot on the architecture you’re targeting.  For the RISC targets (ARM, MIPS, SH), the compiler must assume that the i32 member is misaligned and must generate special code since normal 4-byte load instructions do not work in that case.  In terms of code size, SuperH and ARM suffer the most since they have to load one byte at a time and combine them with a series of shifts and logical ORs.  MIPS is quite a bit better with its special “left” and “right” load instructions, and x86 isn’t affected at all since the CPU supports misaligned addresses for most memory accesses.  I don’t want to speculate too much, but it’s possible that the reason structure packing is so popular is that x86 is so popular.  If more people had to target SH-4, they’d think twice before packing their data types.  Oh, and one thing I should mention is that the 8-bit i8 member isn’t really necessary for this discussion.  Even if it were absent such that i32’s offset from S were zero (0), the generated code would be almost identical.  This is because packing works by modifying the alignment of members.  It’s the alignment of the member, not its offset, which determines how the compiler accesses it.

Saving space

Let’s now take a look at the first reason you might want to use structure packing: to save space.  It’s true that the structure above with PACKING=1 is smaller than the structure with PACKING=4.  The sizeof operator indicates 5 bytes for the former and 8 bytes for the latter.  This might lead one to believe that all data should be packed.  However, if you look at the impact on code size, the benefit is not so obvious.  The code required for each access to misaligned data can be much more than for a normal access, and that is  multiplied by the total number of accesses across the code base.  In one case I know of, a colleague removed a #pragma pack(1) from the main header of his ARM DLL, reducing its size from 300kB to 200kB.  Remember that data is often temporal, i.e. it comes and goes and space for it isn’t always allocated.  However, code will usually live for the entire lifetime of a process, and can also take up space indefinitely in ROM or on disk.

In short, make sure you take into account the code size implications if you think packing will save space.  Make sure you know the performance impact as well.  It should come as no surprise that the ARM and SuperH sequences for misaligned accesses are slower than the aligned sequences.  However, even the x86 sequence is usually slower if the memory is misaligned, because modern CPUs have to access both of the enclosing (aligned) words in order to access a misaligned word.

Recommendation: Instead of packing to save space, consider reordering your data structures so that larger members always precede smaller members (or, rather, more-aligned members precede less-aligned members).  That way, you will have little or no padding except possibly at the end of the structure.  Padding at the end of a structure affects array allocations, but little else.

Matching byte stream formats

The other common reason people use packing is to implement network protocols or to parse byte streams.  Packing can make it more convenient to write code for certain data formats.  Take this (made up) packet format as an example:

 

Signature

(16-bit)

 

 

Size

(32-bit)

 

Protocol

(16-bit)

 

Checksum

(32-bit)

 

Payload

(N-bit)

 

If we were to declare the structure like this:

struct packet1 {

      unsigned short signature;  // offset 0

      unsigned long size;        // offset 2 or 4?

      unsigned short protocol;   // offset 6 or 8?

      unsigned long checksum;    // offset 8 or 10 or 12?

      unsigned char payload[1];

};

by default, the compiler will insert padding between the signature and size fields in order to maintain the latter’s alignment (4).  One solution to this would be to use #pragma pack(2), which would remove the need for padding.  In some cases, this might be the right thing to do, particularly if the alignment of the beginning of the packet is at most 2-byte.  But wait, as you may have noticed, the offset of the checksum member is a multiple of its natural alignment.  That means that if the beginning of the structure is aligned, it can be accessed safely with a normal 4-byte load or store.  However, if we use #pragma pack(2), the alignment of all fields is capped at 2-byte, forcing the compiler to load it with at least two instructions for most architectures.

What if we can ensure that the beginning of our packet buffer will always be 4-byte aligned?  Is it possible to match the packet format while still loading all fields as efficiently as possible?  Yes, if you’re willing to write a little more code.  One option is to replace the size field with two smaller fields with less strict alignment requirements:

struct packet2 {

      unsigned short signature;  // offset 0

      unsigned short sizeLow;    // offset 2

      unsigned short sizeHigh;   // offset 4

      unsigned short protocol;   // offset 6

      unsigned long checksum;    // offset 8

      unsigned char payload[1];

};

Now we have what we want in terms of layout.  In fact, this is or is similar to what we would have to write if we didn’t have the ability to pack structures at all.  The problem is that now we have to write extra code to get at the size member, which is the main reason we wanted to use packing in the first place.  The key to fixing this is to realize that we just need to reduce the alignment requirement of the size member.  How?  One option is to use #pragma pack.

#pragma pack(push,2)

struct u32_a16 {

      unsigned long u32;

};

#pragma pack(pop)

struct packet3 {

      unsigned short signature;  // offset 0

      struct u32_a16 size;       // offset 2

      unsigned short protocol;   // offset 6

      unsigned long checksum;    // offset 8

      unsigned char payload[1];

};

Note that we have to encapsulate the scalar unsigned long type in a structure because #pragma pack doesn’t affect scalars that are not members of a structure.  The one drawback to this is that, in C, we have to write a little extra code to access the size member, i.e. we’d have to write p->size.u32 instead of just p->size.  You could perhaps hide this overhead in an accessor function.  In C++, however, you can add a little syntactic sugar to make the code look just like we want:

#pragma pack(push,2)

struct u32_a16 {

      inline unsigned long operator=(const unsigned long &that) {

return this->u32 = that;

      }

      inline operator unsigned long() { return u32; }

      unsigned long u32;

};

#pragma pack(pop)

Now the compiler can generate the most efficient code for aligned fields and correct code for the misaligned ones.  Remember, though, if the entire structure may not be aligned, you’re probably best off packing the whole thing since the compiler needs to generate unaligned access code for everything anyway.

Other tips about packing and alignment

·         Be careful when taking the address of a field in a packed structure.   If you assign it to a “normal” pointer, the compiler will lose the fact that it is misaligned.  For example:

 

struct S sample;        // struct from above

int * pi = &sample.i32; // alignment information lost

*pi = 4;                // DATATYPE_MISALIGNMENT exception

 

This can be particularly confusing when including an unpacked structure inside a packed structure.  The compiler has a warning (C4366) to attempt to detect this practice, but it’s not completely reliable. 

·         Try to avoid using packing in public interfaces that have (or will have) backward compatibility requirements.  Even though packing may seem beneficial now, it’s likely that it could be harmful in the future, particularly if the interface is implemented on a different architecture.  It's ok to use #pragma pack in a header file to protect it from other users (see below), but the packing value should be the compiler default (8).

·         If you must use #pragma pack in a header file, be careful not to let it “leak” out and affect structures you never intended.  Always use the push/pop features like you see above, and try to limit the packing scopes to just around the structures you care about.  The latter practice helps avoid someone unintentionally creating packed structures when adding types to your header.

·         Be very wary of One Definition Rule (ODR) violations with packing.  Defining the same type under different packing values in different translation units can lead to bugs that are very difficult to track down.

o   Always define your types in a single header and include that wherever you need it.

o   Don’t #include headers under #pragma pack

o   Use #pragma pack(push,8) at the beginning and #pragma pack(pop) at the end of your headers to protect them from /Zp switches and other people including them under #pragma pack

Conclusions

Packing can be a useful feature, but like many useful features it needs to be understood fully in order to avoid misuse.  Always test your assumptions about packing before making a decision to use it.  “Premature optimization is the root of all evil.”

As always, feel free to ask questions.  I hope my next post will come sooner than this one did. J

 

When I was a developer, and customer, using MSDN in my day-to-day work, I occasionally found myself frustrated by document discoverability. MSDN often had the information I was looking for -- sometimes in multiple formats -- but finding just what you want in MSDN can be quite a task.

We're working to improve this situation for a number of critical scenarios, including device bring-up. One important task for board support package (BSP) developers is porting a BSP from a previous version of Embedded CE to Embedded CE 6.0. Luckily, BSP porting information exists in a number of places.

First, the MSDN Library contains information on porting BSPs, starting at: http://msdn.microsoft.com/en-us/library/aa917748.aspx. You can also find information on porting device drivers, another key device bring-up task, at: http://msdn.microsoft.com/en-us/library/aa931071.aspx.

Channel 9 has an excellent talk by Travis Hobrla of the Embedded CE team on the process of porting a BSP from Embedded CE 5.0 to CE 6.0:  http://channel9.msdn.com/posts/mikehall/Porting-a-CE-50-BSP-to-CE-60-Travis-Hobrla/.

Doug Boling gave a great presentation on the new CE 6.0 kernel which includes porting information at MEDC 2006; you can find that presentation here: http://download.microsoft.com/documents/australia/medc2006/Windows_CE6_Architecture_Boling.ppt.

Please let us know if there are other crucial scenarios about which you're trying to find information! 

I've been working as a technical author and editor since 1995.  I've worked at Microsoft as a Programming Writer since 2006.  I joined Embedded CE and Windows Mobile developer documentation team at the beginning of 2008, where I've worked on documentation for file systems and storage, the kernel, device bring-up, power management, and other Core OS functionality.

In this blog, I'll discuss the Embedded CE and Windows Mobile developer documentation (on MSDN and elsewhere) as it pertains to these areas.  Any feedback on our developer docs is welcome, and appreciated! :)

Posted by: Sue Loh

Hello out there, it's been a long time since I posted anything real, and I feel sorry about that.  As I began writing this article, I had just come from the first day of TechEd where I saw my colleagues present about CE6 and drivers, and was reminded of a subject I was suddenly inspired to write up for you all.  Today is now the last day of TechEd and I'm back home, but my comments still apply.

I'll let you in on something - not so much of a secret.  We all make mistakes.  And this is a blog post about one of my own.  You may have already read about the marshalling APIs on this blog, or otherwise learned of them.  When we designed these APIs, we planned them to hide away complexity in the decisions we made for performance and security reasons - so that OEMs and driver writers would not have to thread a maze of difficult details.  With that in mind, consider the CeAllocAsynchronousBuffer API.  The purpose of this API is to marshal a buffer into a driver's (or server's or service's) process space such that the driver/server/service could access the buffer asynchronously.  The work required to do the marshalling depends on the circumstances.  In kernel mode it probably just needs to be aliased (VirtualCopied) into the kernel, while in user mode it must be duplicated (memcpy'd).  The work also depends on what work CeOpenCallerBuffer might have done beforehand - for example if it is already duplicated into the process.  So, CeAllocAsynchronousBuffer hides all of these details.  You can call it and trust the API to make the right choices for security and perf.  We designed it to hide these details while asking the caller to make no assumptions about what's going on underneath.  Use CeFlushAsynchronousBuffer to guarantee changes have been written back, and CeFreeAsynchronousBuffer to do that plus release any resources.

So that's all well and good.  Enter older ARM CPUs and their virtually-tagged caches.  In the early days of CE6, we hadn't quite come to terms with how to prevent the cache coherency problems you could get if you aliased/VirtualCopied memory.  In later days, we fixed aliasing so that it would make both source and dest buffer uncached for the duration of the alias.  (Specifically, we fixed VirtualAllocCopyEx, NOT VirtualCopy, since I am a stickler for little details.)  But in the early days, when we built the marshalling APIs, we were concerned about cache coherency.  So at that time, in CeAllocAsynchronousBuffer we made ARM virtually-tagged CPUs duplicate the memory instead of alias it.  This, of course, concerned us greatly about performance, and we knew we'd ship a lot of ARM virtually-tagged devices.  So we added MARSHAL_FORCE_ALIAS with the expectation that callers would use it with caution, and deal with cache coherency problems themselves.  That, at least, could probably win some performance on large buffers, even if it did cost complexity.

Later, we got our heads on right and fixed aliasing to leave memory uncached.  So duplication was no longer as important.  But we also made a discovery -- on small buffers, duplication was *faster* than aliasing!  We did some benchmarking and decided that for buffers below 16KB, we'd duplicate, while on larger buffers we'd alias.  But we'd only benchmarked ARM virtually-tagged devices, and so we left the code similar to its original state.  Meaning that we only made the aliasing vs. duplication decision based on size on ARM virtually-tagged devices.  For all other cases, CeAllocAsynchronousBuffer usually aliased.

At that point, in my opinion, we should have removed the MARSHAL_FORCE_ALIAS flag.  Instead, we left it, and now we're in a state where it confuses people.  At TechEd I saw my colleagues recommend it to driver developers for performance reasons - when in my opinion it should never be used.  Let the OS make the decision what's best for performance.  The only case where we don't alias is for small buffers on ARM virtually-tagged caches, where we've demonstrated that duplication is faster than aliasing.  I think it's safe to say, you can look forward to this getting cleaned up in the future.  But remember, my recommendation remains: don't (blindly) use MARSHAL_FORCE_ALIAS!  It won't break anything, but you'll potentially be forcing the wrong thing for performance.

 

Hi, I'm Chaitanya Raje and I am a developer on Compiler and Tools team for Windows Mobile and Windows Embedded CE. This is my first blog on msdn. I hope I will be able to share out some insights into new features and commonly known issues about using the compilers and related tools through my blogs.

 

I would like to start with a write-up on dynamic initialization of variables in C++. C++ (but not C) allows you to initialize global variables with non-constant initializers. For e.g.:

 

Foo.cpp

#include <stdio.h>

int alpha(void)

{

    return 20;

}

 

int i = alpha(); //dynamic intialization

 

int main()

{

    printf("i = %d",i);

    return i;

}

 

According to the C/C++ standards global variables should be initialized before entering main(). In the above program, variable 'i' should be initialized by return value of function alpha(). Since the return value is not known until the program is actually executed, this is called dynamic initialization of variable.

 

The Problem:

Let us compile the above program and link with entrypoint ‘main’.

 

cl Foo.cpp /link /entry:main

 

Here’s your output when you run the exe –

 

i = 0

 

Surprised? We all expected the output to be -”i = 20”. Let us try to understand why we got an unexpected output.

 

The Theory:

The global ‘i’ has a dynamic initializer, so its value is not initialized until the program is executed. Since we linked the exe with entrypoint as’ main’, the C Runtime started executing ‘main()’ as the first function in your program. ‘alpha()’ was never invoked and ‘i’ was never initialized, hence the unexpected output.

 

Now the question is how do we invoke these dynamic initializers before ‘main()’ and still keep the entry point of our program as ‘main()’?

 

The Solution:

The answer lies in C Runtime's startup routines. C Runtime (CRT) defines different startup routines corresponding to your standard entry points as follows –

 

Your entrypoint

CRT entrypoint

 

 

main

mainCRTStartup

wmain

wmainCRTStartup

WinMain

WinMainCRTStartup

wWinMain

wWinMainCRTStartup

DllMain

_DllMainCRTStartup

 

The above CRT startup routines are designed to invoke dynamic initializers in your program to initialize the global variables and then call the corresponding standard entry point. So, if your program uses dynamic initializers, you should set your entry point to one of the CRT startup routines (corresponding to your real entry point from the table above) while linking. Not using the CRT startup routine as an entrypoint (and using a standard entrypoint instead) will keep the global variables that need dynamic initialization, uninitialized.

 

Now let’s compile and link Foo.cpp with CRT entrypoint –

 

cl Foo.cpp /link /entry:mainCRTStartup

 

Here’s your output as expected –

 

i = 20

 

NOTE: The above program will generate a compiler error if compiled as a C program (instead of C++) because dynamic initializers are not allowed by C language.

 

Here are a few more examples of dynamic initializers-

 

1.

class B {

public:

    int i;

    B() {

        i=10;

    }

    ~B() {};

}

B b; //requires dynamic initializer to call constructor B().

 

A global object is a classic example of dynamic initializer. The constructor on a global object needs to be invoked before we enter main.

 

2.

extern char ValueKnown[];

char* Name1 = ValueKnown; //statically initialized with &ValueKnown[0]

#if defined(__cplusplus)

    extern char* ValueUnknown;

    char* Name2 = ValueUnknown; // requires dynamic initializer

#endif

 

ValueKnown and ValueUnknown, though they look very similar, there’s a very subtle difference between them. ValueKnown is a statically initialized array and hence its value (and location) is guaranteed to be known while linking with (and in the .data section of) module in which it is defined. ValueUnknown on the other hand is a char pointer variable whose value may or may not be known at compile-time or during linking with module that defines it. It could be pointing to a constant string or it could have a dynamic initializer itself (in module defining it). This makes the compiler generate a dynamic initializer for variable Name2.

 

 

More details:

Some of you might be curious to know how CRT finds information about dynamic initializers.  The compiler actually sets up things for the CRT. It creates a section named .CRT$XCU in your object file with useful information for the CRT. This section is essentially a list of function pointers or pointers to class constructors which are dynamic initializers for your program. The CRT just loops through this list and invokes them as it goes along. The compiler generates an entry into this section every time it finds a dynamic initializer in your code.

 

The section name is .CRT and XCU is name of the group.              

 

The CRT also defines 2 pointers

- __xc_a in section .CRT$XCA

- __xc_z in section .CRT$XCZ

 

The linker then merges all .CRT groups into one section and orders them alphabetically by group name. This causes the pointers to be laid out as follows -

 

.CRT$XCA

            __xc_a

.CRT$XCU

            Pointer to Global Initializer 1

            Pointer to Global Initializer 2

.CRT$XCZ

            __xc_z

 

__xc_a and __xc_z thus act as demarcations for start and end of dynamic initializer list. CRT can now loop through this list at the startup. Note that order of initialization across modules is neither defined nor easily predictable.

 

 

 

I hope this has given you some insight into the C Runtime's initialization mechanism, but the real point I wanted to convey from this blog is - try to use CRT entrypoints instead of the standard main/Winmain to avoid surprises in your output.

 

If you have any question or comments regarding this topic, please let us know. We'll be more than happy to answer them! If you would like us to write on any particular topic related to compilers and related tools like linker, runtime libraries, etc. we are open to recommendations.

 

Thanks.

 

Chaitanya Raje

on behalf of  Windows Devices Compiler Team

New sample code called the BSP Template is now available for download.  This code serves two major purposes:

1. Provide a stub version of a BSP that illustrates all required and optional BSP functions.

2. Educate newcomers to CE on the basics of BSPs in an incremental fashion.

The BSP Template is compatible with CE6.0 and CE6R2.  You can find it attached to this post.