Aggregation Framework Examples#

This document provides a number of practical examples that display the capabilities of the aggregation framework.

The Aggregations using the Zip Codes Data Set examples uses a publicly available data set of all zipcodes and populations in the United States. These data are available at: zips.json.

Requirements#

Let’s check if everything is installed.

Use the following command to load zips.json data set into mongod instance:

$ mongoimport --drop -d test -c zipcodes zips.json

Let’s use the MongoDB shell to verify that everything was imported successfully.

$ mongo test
connecting to: test
> db.zipcodes.count()
29467
> db.zipcodes.findOne()
{
      "_id" : "35004",
      "city" : "ACMAR",
      "loc" : [
              -86.51557,
              33.584132
      ],
      "pop" : 6055,
      "state" : "AL"
}

Aggregations using the Zip Codes Data Set#

Each document in this collection has the following form:

{
  "_id" : "35004",
  "city" : "Acmar",
  "state" : "AL",
  "pop" : 6055,
  "loc" : [-86.51557, 33.584132]
}

In these documents:

  • The _id field holds the zipcode as a string.

  • The city field holds the city name.

  • The state field holds the two letter state abbreviation.

  • The pop field holds the population.

  • The loc field holds the location as a [latitude, longitude] array.

States with Populations Over 10 Million#

To get all states with a population greater than 10 million, use the following aggregation pipeline:

aggregation1.c#
#include <mongoc/mongoc.h>
#include <stdio.h>

static void
print_pipeline (mongoc_collection_t *collection)
{
   mongoc_cursor_t *cursor;
   bson_error_t error;
   const bson_t *doc;
   bson_t *pipeline;
   char *str;

   pipeline = BCON_NEW ("pipeline",
                        "[",
                        "{",
                        "$group",
                        "{",
                        "_id",
                        "$state",
                        "total_pop",
                        "{",
                        "$sum",
                        "$pop",
                        "}",
                        "}",
                        "}",
                        "{",
                        "$match",
                        "{",
                        "total_pop",
                        "{",
                        "$gte",
                        BCON_INT32 (10000000),
                        "}",
                        "}",
                        "}",
                        "]");

   cursor = mongoc_collection_aggregate (collection, MONGOC_QUERY_NONE, pipeline, NULL, NULL);

   while (mongoc_cursor_next (cursor, &doc)) {
      str = bson_as_canonical_extended_json (doc, NULL);
      printf ("%s\n", str);
      bson_free (str);
   }

   if (mongoc_cursor_error (cursor, &error)) {
      fprintf (stderr, "Cursor Failure: %s\n", error.message);
   }

   mongoc_cursor_destroy (cursor);
   bson_destroy (pipeline);
}

int
main (void)
{
   mongoc_client_t *client;
   mongoc_collection_t *collection;
   const char *uri_string = "mongodb://localhost:27017/?appname=aggregation-example";
   mongoc_uri_t *uri;
   bson_error_t error;

   mongoc_init ();

   uri = mongoc_uri_new_with_error (uri_string, &error);
   if (!uri) {
      fprintf (stderr,
               "failed to parse URI: %s\n"
               "error message:       %s\n",
               uri_string,
               error.message);
      return EXIT_FAILURE;
   }

   client = mongoc_client_new_from_uri (uri);
   if (!client) {
      return EXIT_FAILURE;
   }

   mongoc_client_set_error_api (client, 2);
   collection = mongoc_client_get_collection (client, "test", "zipcodes");

   print_pipeline (collection);

   mongoc_uri_destroy (uri);
   mongoc_collection_destroy (collection);
   mongoc_client_destroy (client);

   mongoc_cleanup ();

   return EXIT_SUCCESS;
}

You should see a result like the following:

{ "_id" : "PA", "total_pop" : 11881643 }
{ "_id" : "OH", "total_pop" : 10847115 }
{ "_id" : "NY", "total_pop" : 17990455 }
{ "_id" : "FL", "total_pop" : 12937284 }
{ "_id" : "TX", "total_pop" : 16986510 }
{ "_id" : "IL", "total_pop" : 11430472 }
{ "_id" : "CA", "total_pop" : 29760021 }

The above aggregation pipeline is build from two pipeline operators: $group and $match.

The $group pipeline operator requires _id field where we specify grouping; remaining fields specify how to generate composite value and must use one of the group aggregation functions: $addToSet, $first, $last, $max, $min, $avg, $push, $sum. The $match pipeline operator syntax is the same as the read operation query syntax.

The $group process reads all documents and for each state it creates a separate document, for example:

{ "_id" : "WA", "total_pop" : 4866692 }

The total_pop field uses the $sum aggregation function to sum the values of all pop fields in the source documents.

Documents created by $group are piped to the $match pipeline operator. It returns the documents with the value of total_pop field greater than or equal to 10 million.

Average City Population by State#

To get the first three states with the greatest average population per city, use the following aggregation:

pipeline = BCON_NEW ("pipeline", "[",
   "{", "$group", "{", "_id", "{", "state", "$state", "city", "$city", "}", "pop", "{", "$sum", "$pop", "}", "}", "}",
   "{", "$group", "{", "_id", "$_id.state", "avg_city_pop", "{", "$avg", "$pop", "}", "}", "}",
   "{", "$sort", "{", "avg_city_pop", BCON_INT32 (-1), "}", "}",
   "{", "$limit", BCON_INT32 (3) "}",
"]");

This aggregate pipeline produces:

{ "_id" : "DC", "avg_city_pop" : 303450.0 }
{ "_id" : "FL", "avg_city_pop" : 27942.29805615551 }
{ "_id" : "CA", "avg_city_pop" : 27735.341099720412 }

The above aggregation pipeline is build from three pipeline operators: $group, $sort and $limit.

The first $group operator creates the following documents:

{ "_id" : { "state" : "WY", "city" : "Smoot" }, "pop" : 414 }

Note, that the $group operator can’t use nested documents except the _id field.

The second $group uses these documents to create the following documents:

{ "_id" : "FL", "avg_city_pop" : 27942.29805615551 }

These documents are sorted by the avg_city_pop field in descending order. Finally, the $limit pipeline operator returns the first 3 documents from the sorted set.