mcsimport is a high-speed bulk load utility that imports data into ColumnStore tables in a fast and efficient manner utilizing ColumnStore's Bulk Write SDK. Unlike cpimport, mcsimport was designed to be executed from a remote machine that doesn't necessarily needs to be a UM or PM. mcsimport is further executable from Windows and Linux operating systems.
Similar to cpimport, mcsimport accepts as input any flat file that contains a delimiter between fields of data (i.e. columns in a table). The default delimiter is a comma (‘,’), but other delimiters such as pipes may also be used. By default mcsimport expects the data values to be in the same order as the create table statement, and a date format of ‘YYYY-MM-DD HH:MM:SS’. But, these settings can be overwritten in a mapping file which allows customizeable input column to ColumnStore column mappings, the usage of individual input column specific date formats utilizing the strptime format, and the specification of default values for non mapped target columns.
It is important to note that:
There are three primary steps to using the mcsimport utility:
On Linux systems mcsimport requires the installation of the ColumnStore Bulk Write SDK, on Windows systems the Bulk Write SDK is bundled with mcsimport and doesn't require an extra installation.
mcsimport can also be installed from our MariaDB ColumnStore Tools repository. Detailed information can be found here.
First, install the Bulk Write SDK and dependencies according to following documentation.
Afterwards, you can install mcsimport via:
sudo rpm -ivh mariadb-columnstore-tools*.rpm
First, install the Bulk Write SDK and dependencies according to following documentation.
Afterwards, you can install mcsimport via:
sudo dpkg -i mariadb-columnstore-tools*.deb
First, install the Bulk Write SDK and dependencies according to following documentation.
Afterwards, you can install mcsimport via:
sudo dpkg -i mariadb-columnstore-tools*.deb
To install mcsimport on Windows 10 you simply have to follow the installation wizard of the installer.
http://downloads.mariadb.com/ColumnStore-Tools/latest/winx64-packages/
As mcsimport is using the Bulk Write SDK for the injection, all ports required by the ColumnStore Bulk write SDK need to be accessible from the client executing mcsimport at the target ColumnStore server. These are in particular the TCP ports 8616, 8630, and 8800.
mcsimport database table input_file [-m mapping_file] [-c Columnstore.xml] [-d delimiter] [-n null_option] [-df date_format] [-default_non_mapped] [-E enclose_by_character] [-C escape_character] [-rc read_cache_size] [-header] [-ignore_malformed_csv] [-err_log]
The mapping file is used to define the mapping between source csv columns and target ColumnStore columns, to define column specific input date formats, and to set default values for ignored target columns. It follows the Yaml 1.2 standard and can address the source csv columns implicit and explicit.
Source csv columns can only be identified by their position in the csv file starting with 0, and target ColumnStore columns can be identified either by their position or name.
Following snippet is an example for an implicit mapping file.
- column: target: 0 - column: - ignore - column: target: id - column: target: occurred format: "%d %b %Y %H:%M:%S" - target: 2 value: default - target: salary value: 20000
It defines that the first csv column (#0) is mapped to the first column in the ColumnStore table, that the second csv column (#1) is ignored and won't be injected into the target table, that the third csv column (#2) is mapped to the ColumnStore column with the name id, and that the fourth csv column (#3) is mapped to the ColumnStore column with the name occurred and uses a specific date format. (defined using the strptime format) The mapping file further defines that for the third ColumnStore target column (#2) its default value will be used, and that the ColumnStore target column with the name salary will be set to 20000 for all injections.
Explicit mapping is also possible.
- column: 0 target: id - column: 4 target: salary - target: timestamp value: 2018-09-13 12:00:00
Using this variant the first (#0) csv source column is mapped to the target ColumnStore column with the name id, and the fifth source csv column (#4) is mapped to the target ColumnStore column with the name salary. It further defines that the target ColumnStore column timestamp uses a default value of 2018-09-13 12:00:00 for the injection.
As mcsimport is built upon ColumnStore's Bulk Write SDK it inherits its methods to connect to ColumnStore instances to ingest data. By default mcsimport uses the standard configuration file /usr/local/mariadb/ColumnStore/etc/Columnstore.xml or if set the one defined through the environment variable COLUMNSTORE_INSTALL_DIR to connect to the remote Columnstore instance. Individual configurations can be defined through the command line parameter -c. Instructions on how to prepare Columnstore.xml for remote ingestion can be found here.
The default delimiter of the CSV input file is a comma (‘,’) and can be changed through the command line parameter -d. Only one character delimiters are currently supported.
By default mcsimport uses YYYY-MM-DD HH:MM:SS as input date format. An individual global date format can be specified via the command line parameter -df using the strptime format. Column specific input date formats can be defined in the mapping file and overwrite the global date format.
By default mcsimport treats input strings with the value "NULL" as data. If the null_option is set to 1 strings with the value "NULL" are treated as NULL values.
mcsimport needs to inject values for all ColumnStore columns of the target table. In order to use the ColumnStore column's default values for all non mapped target columns the global parameter default_non_mapped can be used. Target column specific default values in the mapping file overwrite the global default values of this parameter.
By default mcsimport uses the double-quote character " as enclosing character. It can be changed through the command line parameter -E. The enclosing character's length is limited to 1.
By default mcsimport uses the double-quote character " as escaping character. It can be changed through the command line parameter -C. The escaping character's length is limited to 1.
By default mcsimport uses a read cache size of 20,971,520 (20 MiB) to cache chunks of the input file in RAM. It can be changed through the command line paramter -rc. A minimum cache size of 1,048,576 (1 MiB) is required.
Choose this flag to ignore the first line of the input CSV file as header. (It won't be injected)
By default mcsimport rolls back the entire bulk import if a malformed csv entry is found. With this option mcsimport ignores detected malformed csv entries and continiues with the injection.
With this option an optional error log file is written which states truncated, saturated, and invalid values during the injection. If the command line parameter -ignore_malformed_csv is chosen, it also states which lines were ignored.
© 2019 MariaDB
Licensed under the Creative Commons Attribution 3.0 Unported License and the GNU Free Documentation License.
https://mariadb.com/kb/en/columnstore-remote-bulk-data-import-mcsimport/