This is the tenth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you'll want to find a nice shady tree, relax and meditate on why your world is so confused and full of meaningless tasks that it has kept you from pursuing the perfection of writing IQueryable providers.
Complete list of posts in the Building an IQueryable Provider series
Last time I blamed the television writers strike for delaying me in getting out my next installment. This time I blame the lack of one, and sunny days, and my son riding his bicycle, and the grueling, tiresome task of getting paid. Would you believe some of the code we are writing has nothing what-so-ever to do with IQueryable LINQ providers? Crazy, I know. Maybe it would be a different story around here if we had a few more shady trees.
Fortunately for you I have a prime piece of savory source code ready for slow grilling over a bed of hot mesquite. It's something I'd like to say I was saving for a special occasion, but the truth is I've been putting it off because I thought the code would be just too overwhelming. I mean for me, not you. You see I've been trying to do two things with this series; one, provide a guide to help developers navigate and understand the power and flexibility of the IQuerayble interface, and two, attempt to prove to myself that code written in a purely functional style (or as pure as I can make it) can end up much cleaner and easier on the eyes. I've been horrified ever since tackling OrderBy that this subject would become my undoing. Knowing how involved the translation was for LINQ to SQL I was dreading the immutable madness that would ensue. Obviously, I survived the ordeal and so will you.
Grappling with GroupBy and Aggregates
Translating just GroupBy itself might not be so difficult if I did not have to also account for aggregates. It seems that GroupBy is not very interesting without the ability to write expressions involving Count() and Max() and Sum(), and that's where the difficulty sets in.
A query involving groups and aggregates that looks like this:
var query = from o in db.Orders
group o by o.CustomerID into g
select g.Max(o => o.OrderID);
Is translated into a series of method calls that looks like this:
var query = db.Orders.GroupBy(o => o.CustomerID).Select(g => g.Max(o => o.OrderID));
And this is a problem because as I translate from the bottom up (as I have been doing all along) building individual SELECT subqueries for each query operator, I won’t even discover the use of the aggregate ‘Max’ until after I’ve created the SELECT with the GROUP BY. So how do I get the two pieces together in the same place at the same time without violating my functional style and immutable expression nodes? Weirder yet, how do I even know that a call to an aggregate method should be tied back to a particular GroupBy? What if I had two group-bys?
So where do I start to explain? Maybe the easy stuff first. I’ve added a GroupBy property to SelectExpression.
internal class SelectExpression : Expression {
...
ReadOnlyCollection<Expression> groupBy;
internal SelectExpression(
...,
IEnumerable<Expression> groupBy
)
internal ReadOnlyCollection<Expression> GroupBy {
get { return this.groupBy; }
}
}
Now every place where I construct one I have to specify a collection of grouping expressions (or null). GroupBy expressions also get visited after everything else (so far). Here’s the new version of VisitSelect in DbExpressionVisitor.
protected virtual Expression VisitSelect(SelectExpression select) {
Expression from = this.VisitSource(select.From);
Expression where = this.Visit(select.Where);
ReadOnlyCollection<ColumnDeclaration> columns = this.VisitColumnDeclarations(select.Columns);
ReadOnlyCollection<OrderExpression> orderBy = this.VisitOrderBy(select.OrderBy);
ReadOnlyCollection<Expression> groupBy = this.VisitExpressionList(select.GroupBy);
if (from != select.From
|| where != select.Where
|| columns != select.Columns
|| orderBy != select.OrderBy
|| groupBy != select.GroupBy
) {
return new SelectExpression(select.Type, select.Alias, columns, from, where, orderBy, groupBy);
}
return select;
}
I also added an AggregateExpression class. It represents a call to an aggregate operator.
internal enum AggregateType {
Count,
Min,
Max,
Sum,
Average
}
internal class AggregateExpression : Expression {
AggregateType aggType;
Expression argument;
internal AggregateExpression(Type type, AggregateType aggType, Expression argument)
: base((ExpressionType)DbExpressionType.Aggregate, type) {
this.aggType = aggType;
this.argument = argument;
}
internal AggregateType AggregateType {
get { return this.aggType; }
}
internal Expression Argument {
get { return this.argument; }
}
}
Now I at least know what to turn those aggregates into. But how do I get these aggregate expressions into the right place in the query tree? What if I did nothing and just let the aggregates fall where they may? What would it look like? Would it even be legal SQL?
SELECT MAX(t5.OrderID)
FROM (
SELECT t0.CustomerID
FROM Orders AS t0
GROUP BY t0.CustomerID
) AS t5
Oh, no. Danger Will Robinson! This is not legal SQL. Where does OrderID even come from? It’s not even projected out of the sub-query with the GROUP BY. And even if it were somehow projected, the result of the query would be all wrong. It would give me a single maximum instead of one for each group. Getting group-by & aggregates to work together is going to be tricky.
Is it even possible to solve it and maintain my layering approach? What about correlated sub-queries? I could form a sub-query that joins back to the original query based on the grouping expressions and have it contain the aggregate.
SELECT (
SELECT MAX(t2.OrderID)
FROM Orders AS t2
WHERE ((t2.CustomerID IS NULL AND t5.CustomerID IS NULL) OR (t2.CustomerID = t5.CustomerID))
) AS c0
FROM (
SELECT t0.CustomerID
FROM Orders AS t0
GROUP BY t0.CustomerID
) AS t5
Now that at least executes on the server. It might not be pretty and likely not efficient unless the server is really good at unscrambling my query, but it does execute and produce the correct results. So it is technically legal. But surely I can do better!
The GroupBy Operator
By now you are thinking, OMG, he’s going to stick us with a wacky solution like he did with nested queries! Never fear, my friend. Do not look at the man behind the current. Cast your gaze ahead and all will be made clear. In my hands, I have the GroupBy operator. Watch as it transforms into a SQL query inside the QueryBinder!
protected override Expression VisitMethodCall(MethodCallExpression m) {
if (m.Method.DeclaringType == typeof(Queryable) ||
m.Method.DeclaringType == typeof(Enumerable)) {
switch (m.Method.Name) {
...
case "GroupBy":
if (m.Arguments.Count == 2) {
return this.BindGroupBy(
m.Arguments[0],
(LambdaExpression)StripQuotes(m.Arguments[1]),
null,
null
);
}
else if (m.Arguments.Count == 3) {
return this.BindGroupBy(
m.Arguments[0],
(LambdaExpression)StripQuotes(m.Arguments[1]),
(LambdaExpression)StripQuotes(m.Arguments[2]),
null
);
}
else if (m.Arguments.Count == 4) {
return this.BindGroupBy(
m.Arguments[0],
(LambdaExpression)StripQuotes(m.Arguments[1]),
(LambdaExpression)StripQuotes(m.Arguments[2]),
(LambdaExpression)StripQuotes(m.Arguments[3])
);
}
break;
...
}
}
return base.VisitMethodCall(m);
}
As you can see, I’m handling three different variations of GroupBy. The first one simply takes a single keySelector lambda (the thing to group by). It is supposed to return a sequence of grouping’s that each have a group key value and a collection of elements that make up that group. The second form is the same as the first except it also includes an elementSelector lambda that allows me to specify a map between an item in the source sequence and its shape of the elements in the group. The last form includes a resultSelector lambda that allows me to specify what the key and group collection turn into, instead of assuming they always form into instances of IGrouping<K,E>. With all these variations, the translation of group-by is going to get a bit involved, so bear with me.
Get ready. Get set. Open your eyes. Close them. Now open them again and really look. Go!
protected virtual Expression BindGroupBy(Expression source, LambdaExpression keySelector, LambdaExpression elementSelector, LambdaExpression resultSelector) {
ProjectionExpression projection = this.VisitSequence(source);
this.map[keySelector.Parameters[0]] = projection.Projector;
Expression keyExpr = this.Visit(keySelector.Body);
Expression elemExpr = projection.Projector;
if (elementSelector != null) {
this.map[elementSelector.Parameters[0]] = projection.Projector;
elemExpr = this.Visit(elementSelector.Body);
}
// Use ProjectColumns to get group-by expressions from key expression
ProjectedColumns keyProjection = this.ProjectColumns(keyExpr, projection.Source.Alias, projection.Source.Alias);
IEnumerable<Expression> groupExprs = keyProjection.Columns.Select(c => c.Expression);
// make duplicate of source query as basis of element subquery by visiting the source again
ProjectionExpression subqueryBasis = this.VisitSequence(source);
// recompute key columns for group expressions relative to subquery (need these for doing the correlation predicate)
this.map[keySelector.Parameters[0]] = subqueryBasis.Projector;
Expression subqueryKey = this.Visit(keySelector.Body);
// use same projection trick to get group-by expressions based on subquery
ProjectedColumns subqueryKeyPC = this.ProjectColumns(subqueryKey, subqueryBasis.Source.Alias, subqueryBasis.Source.Alias);
IEnumerable<Expression> subqueryGroupExprs = subqueryKeyPC.Columns.Select(c => c.Expression);
Expression subqueryCorrelation = this.BuildPredicateWithNullsEqual(subqueryGroupExprs, groupExprs);
// compute element based on duplicated subquery
Expression subqueryElemExpr = subqueryBasis.Projector;
if (elementSelector != null) {
this.map[elementSelector.Parameters[0]] = subqueryBasis.Projector;
subqueryElemExpr = this.Visit(elementSelector.Body);
}
// build subquery that projects the desired element
string elementAlias = this.GetNextAlias();
ProjectedColumns elementPC = this.ProjectColumns(subqueryElemExpr, elementAlias, subqueryBasis.Source.Alias);
ProjectionExpression elementSubquery =
new ProjectionExpression(
new SelectExpression(TypeSystem.GetSequenceType(subqueryElemExpr.Type), elementAlias, elementPC.Columns, subqueryBasis.Source, subqueryCorrelation),
elementPC.Projector
);
string alias = this.GetNextAlias();
// make it possible to tie aggregates back to this group-by
GroupByInfo info = new GroupByInfo(alias, elemExpr);
this.groupByMap.Add(elementSubquery, info);
Expression resultExpr;
if (resultSelector != null) {
Expression saveGroupElement = this.currentGroupElement;
this.currentGroupElement = elementSubquery;
// compute result expression based on key & element-subquery
this.map[resultSelector.Parameters[0]] = keyProjection.Projector;
this.map[resultSelector.Parameters[1]] = elementSubquery;
resultExpr = this.Visit(resultSelector.Body);
this.currentGroupElement = saveGroupElement;
}
else {
// result must be IGrouping<K,E>
resultExpr = Expression.New(
typeof(Grouping<,>).MakeGenericType(keyExpr.Type, subqueryElemExpr.Type).GetConstructors()[0],
new Expression[] { keyExpr, elementSubquery }
);
}
ProjectedColumns pc = this.ProjectColumns(resultExpr, alias, projection.Source.Alias);
// make it possible to tie aggregates back to this group-by
Expression projectedElementSubquery = ((NewExpression)pc.Projector).Arguments[1];
this.groupByMap.Add(projectedElementSubquery, info);
return new ProjectionExpression(
new SelectExpression(TypeSystem.GetSequenceType(resultExpr.Type), alias, pc.Columns, projection.Source, null, null, groupExprs),
pc.Projector
);
}
It starts off easy, following the same binding pattern as the other operators. First I bind the key selector and get a key expression. Then if I have an element selector I bind that too, to get an element expression, otherwise I use the projector expression from the incoming source as the element expression. Got it?
Now I have to figure out what the grouping expressions are going to be. You might be thinking I already know what they are, and there’s only one of them. It’s the key expression, right? Well, not exactly. It turns out its common to need to group by more than one property or column of data. The only way to do that with LINQ is to construct an anonymous type with multiple parts. So a key might be an anonymous type with multiple interesting fields. How do I turn that into one or more expression that can be evaluated as part of a SQL group by clause?
It turns out I already have a handy function that does what I want. If I treat the key expression they same way I treat a projection expression, I can use the ColumnProjector to turn the expression into a list of column declarations, each with its own scalar expression. Then I can simply throw away the column declarations and keep the expressions that define the columns and voila I have grouping expressions! Don’t try this at home. Err, I mean, go ahead and try this at home. It’s easy. Watch, I’ll do it again in a moment.
Moving along we come to some code trying to compute a ‘basis’. What’s that? Well, it turns out SQL doesn’t really support a GroupBy operator the way it is defined in LINQ. The LINQ operator, by default, produces a sequence of groups with each group containing the elements that got designated as part of that group. SQL group-by can only produce aggregate computations over the groups, not the groups themselves. Oh, dear, what to do, what to do?
The only way to solve this problem is to form a self join back to the data I want. If I take the result of the group-by and join it back to the original query (everything up until the group-by) matching group-by key values I get all the data back that SQL was not going to let me have. That’s what the ‘subqueryBasis’ is. It’s a rebinding of the original input sequence. Then we rebind the key again and do my ColumnProjector trick to get a new collection of group expressions. Now I can use both sets to form a predicate for a join.
Next, you can see that if I did specify an element selector I now bind it with respect to this new ‘basis’ and use the resulting expression as the projection coming out of this new branch of the query tree.
Alas, I am not yet done. What about that third argument? If it’s specified then I have another lambda I have to do something with and if it is not specified then I need to figure out how I’m going to return an IGrouping<K,E>.
Okay, that doesn’t turn out to be too difficult. I created a Grouping<K,E> class that I can use in case I need to actually return this stuff all the way back to the calling code.
public class Grouping<TKey, TElement> : IGrouping<TKey, TElement> {
TKey key;
IEnumerable<TElement> group;
public Grouping(TKey key, IEnumerable<TElement> group) {
this.key = key;
this.group = group;
}
public TKey Key {
get { return this.key; }
}
public IEnumerator<TElement> GetEnumerator() {
return this.group.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator() {
return this.group.GetEnumerator();
}
}
Even if I don’t, I can still use this type as a way to encode the fact that I’m returning a grouping, which is what I do. The final result of the call to GroupBy (unless a result selector was supplied) is a sequence of these groupings, so the projector expression resulting from this operation is simply a construction of one of these Grouping<K,E> types via a NewExpression node. If a result selector is supplied that lambda determines the shape of the result so I just bind that the same way I bind all the other lambdas. In both cases, I’m using the newly created extra ‘basis’ query that projects out elements to form the group collection that is either transformed with the result selector or returned to the caller.
That’s about it; clean and neat. Right? What about those other few lines of code referring to ‘GroupByInfo’ and ‘currentGroupElement’? That, sir, is the man behind the current. Now I’ll show you a little of him.
Later, when I’m binding aggregates I will need to invent a way to tie those aggregates back to this group-by; if they belong to this group-by. So, later, I’m going to need to know something about this group-by (and certainly all group-by’s that might have already been encountered) to be able to pick the right one. This is what the GroupByInfo is. It’s the bits of information I hope will be useful later to correlate the two pieces together. If we move along now to aggregate binding I’ll show you how it’s used.
Aggregate Binding
Aggregates get treated just like any other operator.
protected override Expression VisitMethodCall(MethodCallExpression m) {
if (m.Method.DeclaringType == typeof(Queryable) ||
m.Method.DeclaringType == typeof(Enumerable)) {
...
switch (m.Method.Name) {
case "Count":
case "Min":
case "Max":
case "Sum":
case "Average":
if (m.Arguments.Count == 1) {
return this.BindAggregate(m.Arguments[0], m.Method, null);
}
else if (m.Arguments.Count == 2) {
LambdaExpression selector = (LambdaExpression)StripQuotes(m.Arguments[1]);
return this.BindAggregate(m.Arguments[0], m.Method, selector);
}
break;
}
}
return base.VisitMethodCall(m);
}
The actual binding has a few extra bits that help make the job a bit easier.
Expression currentGroupElement;
class GroupByInfo {
internal string Alias { get; private set; }
internal Expression Element { get; private set; }
internal GroupByInfo(string alias, Expression element) {
this.Alias = alias;
this.Element = element;
}
}
private AggregateType GetAggregateType(string methodName) {
switch (methodName) {
case "Count": return Aggregate