What are mappers and reducers

Customize the mapper

When Hive starts a Hadoop job, it is processed by one or more mapper tasks. Given that your DynamoDB table has enough throughput capacity, you can change the number of mappers in the cluster, potentially improving performance.

The number of mapper tasks used in a Hadoop job is determined byInput splits, where Hadoop divides the data into logical blocks. If Hadoop does not do enough input splits, your writes may not be able to use all of the write throughput capacity of the DynamoDB table.

Increase the number of mappers

Each mapper in an Amazon EMR has a maximum read rate of 1 MiB per second. The number of mappers in a cluster depends on the size of the nodes in your cluster. (For more information about node sizes and the number of mappers per node, see Task configuration in theAmazon EMR Developer's Guide.)

If your DynamoDB table has enough throughput capacity for reads, you can try increasing the number of mappers by doing one of the following:

  • Increase the size of the nodes in your cluster. For example, if your cluster B. Node of the type m1.large (three mappers per node) you can click on m1.xlarge-Node (eight mappers per node).

  • Increase the number of nodes in your cluster. If you z. B. via a three-node cluster withm1.xlargeNode you have a total of 24 mappers available. If you double the size of the cluster with the same type of node, you get 48 mappers.

You can use the AWS Management Console to manage the size and number of nodes in your cluster. (You may have to restart the cluster for these changes to take effect.)

Another way to increase the number of mappers is to use the Hadop configuration parameters. (This is a Hadoop, not a Hive parameter. You cannot change it interactively at the command prompt.) If you change the value ofYou can increase the number of mappers without changing the size or number of nodes. However, it can happen that the storage space of the cluster nodes is no longer sufficient if you set the value too high.

You set the value for when you start your Amazon EMR cluster for the first time. For more information, see Creating Bootstrap Actions to Install Additional Software (Optional) in theAmazon EMR sion manualout.

Reduce the number of mappers

If you use the Option statement to select data from an external Hive table associated with DynamoDB, the Hadoop job can consume as many tasks as needed, up to the maximum number of mappers in the cluster. In this scenario, it is possible for a time-consuming Hive query to use all of the provided read capacity of the DynamoDB table, which has a negative impact on other users.

You can use the parameter whether to set an upper limit for assignment tasks:

This value must be equal to or greater than 1. When Hive processes your query, the corresponding Hadoop job consumes no more than read from the DynamoDB table.