Monday, May 21, 2018

Sharded Collections


Sharding allows you to seamlessly distribute rows of a table over multiple instances while keeping this implementation detail hidden from the client.

This can be an extremely powerful tool that allows you distribute your work load over many smaller/cheaper instances, achieve greater parallelism for queries and scale up your environment as you grow.

When people think of sharding, they typically think of MongoDB and the sharding of JSON documents over multiple nodes, which Informix can also do, but Informix can also shard a relational table over multiple nodes and that seems like a fun exercise. Let's try it!

What we will do
  • Create a second HCL Informix AWS instance
  • Configure both of our instances to allow sharding
  • Create a sharded collection for our relational stock_trans table
  • Create a third instance and add it to the sharded collection
  • Remove the second instance from our sharded collection
  • Do some performance testing along the way
Create a second HCL Informix AWS instance

Follow the steps from the last two blog posts (this and this) to setup a second AWS instance. Some steps you won't have to do because we will use the same Security Groups and Key Pair we created earlier.

1. Create an new Elastic IP
2. Configure and launch a second HCL Informix EC2 instance using the Security Group and Key Pair we created earlier
3. Assign the Elastic IP to this new instance

Configure both instances to allow sharding

We're going to follow steps 1 through 8 in the AWS instance changes section of the previous post to change the DBSERVERNAME from ol_aws to informix_1 and informix_2 and we're going to use the number in the DBSERVERNAME to map to a shard id. (informix_1 is shard 1, informix_2 is shard 2, etc).

We also need to make some additional ONCONFIG changes on both instances. Make the following changes to informix_1 and bounce the engine. For informix_2, you can make these changes when you change the other ONCONFIG parameters.

SHARD_ID <shard id>
CDR_SERIAL 20,<shard id>
CDR_AUTO_DISCOVER 1

The table we are going to shard has a BIGSERIAL field and Informix sharding is build on Enterprise Replication so we want to ensure that serial values are unique across all shards and that is where CDR_SERIAL comes into play. When CDR_SERIAL is defined as 20,1 for shard 1 we will generate serial values of 1, 21, 41, 61, 81,... when rows are inserted on shard 1 and 2, 22, 42, 62, 82,... when rows are inserted on shard 2. I'm limiting myself to a maximum of 20 shards with this configuration because I know I'm not going to create more than that.

CDR_AUTO_DISCOVER will allow us to use the cdr autoconfig server command that simplifies setting up Enterprise Replication

We can also add the informix_2_ssl information to our client's sqlhosts file and test the connection from our client when informix_2 comes online.

informix_1_ssl onsocssl <elastic ip> 9089
informix_2_ssl onsocssl <elastic ip> 9089

2. Security Group

We need to open up port 9088 for all of the servers in our private AWS network. First we need to figure out what the CIDR for our private network is. In the AWS console click Services in the top menu and search for VPC and open the VPC Dashboard. Click on Your VPCs in the left hand pane to bring up details for your VPCs. Find the value under IPv4 CIDR and remember it. It will be something like 172.31.0.0/16

In the EC2 Dashboard, modify our Security Group again and add an inbound rule for TCP port 9088 with a Source of your IPv4 CIDR.

3. Create a new Storage Pool

For reasons unbeknownst to me, the cdr autoconfig server command just doesn't like the storage pool that comes with the engine. No problem, we can add a new storage pool to informix_1 and informix_2 with ease by running the following on both instances.

> mkdir /data/storagepool_2
> chmod 755 /data/storagepool_2
> dbaccess sysadmin

execute function task('STORAGEPOOL ADD','/data/storagepool_2',0,0,'64MB',1);
execute function task("modify config persistent", "SBSPACENAME", "sbspace1");
execute function task('CREATE SBSPACE FROM STORAGEPOOL', 'sbspace1', '100MB', 1);

4. Trusted Connections

All of our shard servers need to connect to each other without supplying a password. We can do this by creating trusted hosts with the sysadmin:task function.

Run the following sysadmin tasks on informix_1 only.

execute function task("cdr add trustedhost", "<informix_1 internal DNS name>");
execute function task("cdr add trustedhost", "<informix_2 internal DNS name>");

5. Define the initial shard server with the cdr autoconfig server

informix_1> cdr autoconfig server

This command is going to setup some ER dbspaces in our storage pool, automatically change our sqlhosts file to create an ER group and initialize our first ER server on informix_1

6. Define the second shard server with cdr autoconfig server
informix_2> cdr autoconfig server -H <informix_1 internal DNS name> -P 9088

This command does some nice things for us.

Not only does it create our ER dbspaces like before, it add our new informix_2 server and group to all sqlhosts files in the ER network, propagates the trusted host information from informix_1 to all servers in the ER network and starts ER on informix_2.

Created sharded collection

We're going to drop the old stocks database we created earlier and recreate the stocks database and stock_trans table on both servers.


database sysmaster;

drop database if exists stocks;

create database stocks in dbs_1 with buffered log;

create table stock_trans (
   id            bigserial,
   account_id    integer,
   stock_id      integer,
   action        char(1),
   shares        integer,
   price         float,
   timestamp     datetime year to second
) in dbs_1 extent size 20000 next size 20000 lock mode row;

create unique index stock_trans_pk on stock_trans(id);
alter table stock_trans add constraint primary key (id)
        constraint stock_trans_pk;

grant connect to client;

grant select, insert, update, delete on stock_trans to client;

The next step is to run the cdr define shardCollection command, but I think we need to pause for a second and talk about what's going to happen when we do and some steps our clients are going to need to do to take to take advantage of sharding.

When the shard is created for the stock_trans table and sharding is enabled by the client and we insert a row into the stock_trans table the row will eventually end up physically living on informix_1 or informix_2, but the client doesn't know or care. The same goes for select, update and delete, the sharding is invisible to the client. To the client the stock_trans table just looks like a regular table with the exception that it can be accessed from either informix_1 or informix_2.

We will tell the shard collection to determine which shard to put a row on based on a hash of the id field which will give us a pretty even distribution of data over all our shards.

cdr define shardCollection stock_trans_collection stocks:informix.stock_trans --type=delete --key=id --strategy=chash --partitions=3 --versionCol=id g_informix_1 g_informix_2

Now that the shard collection has been defined, the stock_trans tables on informix_1 and informix_2 will effectively been joined into one table when the client sets the use_sharding session environment to on via 'set environment use_sharding on'.

After connecting to informix_1, setting 'use_sharding on' and inserting 10,000 records I can see that the collection is working.

set environment use_sharding on;

select 'informix_1', count(*) from stocks@informix_1:stock_trans; -- table only
select 'informix_2', count(*) from stocks@informix_2:stock_trans; -- table only
select 'collection', count(*) from stock_trans;                   -- collection

(constant)       (count(*))

informix_1             6742


(constant)       (count(*))

informix_2             3258


(constant)       (count(*))

collection            10000

select first 5 account_id, sum(shares) from stock_trans group by 1 order by 2 desc;

 account_id            (sum)

       5803             7376
       6637             6985
       5232             6714
       1432             6596
       2315             6485

The first 2 selects query the stock_trans table that is physically located on each instance, but the third and fourth select query the collection. Behind the scenes both stock_trans tables are queried in parallel and the results are merged and send back to the client.

Add a new shard to the collection

Over time our stock business has grown and and 2 shards isn't enough for our workload, so we would like to add a third.

You can follow the steps we used to create informix_2 to create informix_3, making sure to run the "cdr add trustedhost" sysadmin:task on informix_1 to add informix_3 to the list of trusted hosts and run the same cdr autoconfig serv command on informix_3 that we used to add informix_2. Don't forget to create the client user, stock database and stock_trans table on informix_3.
When informix_3 has been added to the ER network, run the cdr change shardCollection command to add a new informix_3 shard and automatically rebalance data over the 3 shards

cdr change shardCollection stock_trans_collection --add g_informix_3

We can verify the collection rebalanced and is working properly

set environment use_sharding on;

select 'informix_1', count(*) from stocks@informix_1:stock_trans; -- table only
select 'informix_2', count(*) from stocks@informix_2:stock_trans; -- table only
select 'informix_3', count(*) from stocks@informix_3:stock_trans; -- table only
select 'collection', count(*) from stock_trans;                   -- collection


(constant)       (count(*))

informix_1             2876


(constant)       (count(*))

informix_2             3258


(constant)       (count(*))

informix_3             3866


(constant)       (count(*))

collection            10000

Taking advantage of the sharded collection

In the last post we saw that going over the public internet slowed things down considerably, but we can use our new sharded setup to cut that time down.

When we just had 1 stand alone table, it took 384.8 seconds to insert 10,000 stock_trans rows. What if we loaded 3 sets of data in parallel, 3333 rows to informix_1, 3333 rows to informix_2 and 3334 rows to informix_3?

Splitting the data set and loading in parallel cuts the load time down to 163.4 seconds, over twice as fast as before.

Removing a shard from the collection

Let's say we no longer want informix_2 to be a shard. Hey, it happens. Removing it is easy.

cdr change shardCollection stock_trans_collection --drop g_informix_2

When we take a look under the covers, we see that the informix_2 stock_trans table has been emptied and the rows redistributed among informix_1 and informix_3.

set environment use_sharding on;

select 'informix_1', count(*) from stocks@informix_1:stock_trans; -- table only
select 'informix_2', count(*) from stocks@informix_2:stock_trans; -- table only
select 'informix_3', count(*) from stocks@informix_3:stock_trans; -- table only
select 'collection', count(*) from stock_trans;                   -- collection

(constant)       (count(*))

informix_1             6769


(constant)       (count(*))

informix_2                0


(constant)       (count(*))

informix_3            13231


(constant)       (count(*))

collection            20000


Well, that's enough fun for today. I hope you've enjoyed taking shards and Informix on AWS for a test drive.

I have a couple of other ideas for this technology. We definitely need to get some HDR Secondaries going. Not only because we like to have redundancy, but a secondary can also be used to satisfy selects and I'd like to see how that improves response times. I also want to see how Updateable Secondaries work with shard collections.

Don't forget to check out the IIUG World 2018 Informix event in Washington, DC October 28 to November 1, 2018.

Early Registration is now open (save $275) and we are now accepting Speaker Proposals (save 100%) and we even have a new website from this century and a Twitter account you can follow @IIUGWorld2018 for the latest updates.




No comments:

Post a Comment