Friday, November 28, 2014

Ab Initio Components


There are several components in Abinitio to build the graph's, It is divided in two sets:
-          Dataset Components:-Components Which  holds data
-          Program Components:-Components which  process data


Dataset Components
1.      Input file :

Input File represents records read as input to a graph from one or more serial files or from a multi file.



We can use multiple files (of same type) as input

Click on partition radio and the click the edit button. In the edit box mention the variable name which points the files and the variable has to be defined in a fnx file like

export INPUT_FILES=`ls -1 $AI_TEMP/MML/CCE*`

or in sandbox where the left column should have the variable name (INPUT_FILES) and right column should have the definition ($AI_TEMP/MML/CCE*)

This INPUT_FILES points all the files under $AI_TEMP/MML directory which are stated with CCE

2.      Input table

Input Table unloads records from a database into a graph, allowing you to specify as the source either a database table or an SQL statement that selects records from one or more tables.



3.      Output file:

Output File represents records written as output from a graph into one or more serial files or a multifile.
The output file can be created in write or append mode or permission for the other user can be controlled .
When the target of an Output File component is a particular file (such as /dev/null, NUL, a named pipe, or some other special file), the Co>Operating System never deletes and re-creates that file, nor does it ever truncate it.



4.      Output table:

Output Table loads records from a graph into a database, letting you specify the destination either directly as a single database table, or through an SQL Statement that inserts records into one or more tables.



Program Components

1.      Sort components: Sort component reorders data . You can use Sort to order records before you send them to component that requires grouped or sorted records. It comprises two parameters

KeyKey is one of the parameters for sort components which describes the collation order.(key_specifier, required) 
Name(s) of the key field(s) and the sequesnce specifier(s) you want the component to use when it orders records.

Max-core:Max-core parameter controls how often the sort component dump data from memory to disk.
(integer, required)
Maximum memory usage is in bytes.
Default is 100663296 (100 MB).
When the component reaches the number of bytes specified in the max- core paramenter, it sorts the records it has read and writes a temporary file to disk.

2.      Reformat components:


Reformat changes the record format of data records by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records. By default reformat has got one output port but incrementing value of count parameter number. But for that two different transform functions has to be written for each output port.

If any selection from input ports is required the select parameter can be used instead of using ‘Filter by expression’ component before reformat
1.       Reads record from input port
2.       Record passes as argument to transform function or xfr
3.       Records written to out ports, if the function returns a success status
4.       Records written to reject ports, if the function returns a failure status

Parameters of Reformat Component
       Count
       Transform (Xfr) Function
       Reject-Threshold
     -          Abort
     -          Never Abort
     -          Use Limit & Ramp

3.      Join Components:

Join reads the records from multiple ports, operates on the records with matching keys using a multi input transform function and writes the result into output ports.

Join deals with two activities.
1. Transforming data sources with different record format.

2. Combining data sources with the same record format.

Join Types
Inner Join: 

It uses only records with matching keys on both inputs. This type of join is used to produce an output record only if there is a record in input 0 with a key that matches a record in input 1.



Full Outer Join:

It uses all records from both inputs. If a record from one does not have a matching record in the other input, a NULL record is used for the missing record. This type of join is used to produce output for all of the records that are present in either input 0 or input 1.



Explicit Join (Semi Join):

It uses all records in one specified input, but records with matching keys in the other inputs are optional. Again a NULL record is used for the missing records.

Case1( In 0 required): In this type of join, a record is required on input 0, but the presence of a record on input 1 with the same key is optional. There are two key combinations that can produce here, so that it will again be necessary to prioritize rules in the transform.
Case2( In 1 required): In this type of join, a record is required on input 1, but the presence of a record on input 0 with the same is optional.

Join Methods:
There are two types of methods
      Merge Join: Using sorted inputs
       Hash Join: Using in-memory hash tables to group input

A merge join is performed by sorting the two data sets to be joined according to the join keys and then merging them together. The merge is very cheap, but the sort can be prohibitively expensive especially if the sort spills to disk. The cost of the sort can be lowered if one of the data sets can be accessed in sorted order via an index, although accessing a high proportion of blocks of a table via an index scan can also be very expensive in comparison to a full table scan.
A hash join is performed by hashing one data set into memory based on join columns and reading the other one and probing the hash table for matches. The hash join is very low cost when the hash table can be held entirely in memory, with the total cost amounting to very little more than the cost of reading the data sets. The cost rises if the hash table has to be spilled to disk in a one-pass sort, and rises considerably for a multipass sort.
The cost of a hash join can be reduced by partitioning both tables on the join key(s). This allows the optimiser to infer that rows from a partition in one table will only find a match in a particular partition of the other table, and for tables having n partitions the hash join is executed as n independent hash joins. This has the following effects:


  • The size of each hash table is reduced, hence reducing the maximum amount of memory required and potentially removing the need for the operation to require temporary disk space.

  • For parallel query operations the amount of inter-process messaging is vastly reduced, reducing CPU usage and improving performance, as each hash join can be performed by one pair of PQ processes.

  • For non-parallel query operations the memory requirement is reduced by a factor of n, and the first rows are projected from the query earlier.

You should note that hash joins can only be used for equi-joins, but merge joins are more flexible.
In general, if you are joining large amounts of data in an equi-join then a hash join is going to be a better bet.

4.      Filter by Expression components:

Filter by Expression filters records according to a specified DML expression.
Basically it can be compared with the where clause of sql select statement.
Different functions can be used in the select expression of the filter by expression component even lookup can also be used.
-Reads data records from the in port.
-Applies the expression in the select_expr parameter to each record. If the expression returns:

  • Non-0 value — Filter by Expression writes the record to the out port.
  • 0 — Filter by Expression writes the record to the deselect port. If you do not connect a flow to the deselect port, Filter by Expression discards the records.
  • NULL — Filter by Expression writes the record to the reject port and a descriptive error message to the error port.
Filter by Expression stops execution of the graph when the number of reject events exceeds the result of the following formula:
limit + (ramp * number_of_records_processed_so_far)


5.      Normalize components:
Normalize generates multiple output records from each of its input records. You can directly specify the number of output records for each input record, or the number of output records can depend on some calculation.
1. Reads the input record.

  • If you have not defined input_select, Normalize processes all records.
  • If you have defined input_select, the input records are filtered as follows:
2. Performs iterations of the normalize transform function for each input record.
3. Performs temporary initialization.
4. Sends the output record to the out port.

6.      Denormalize componennts:
Denormalize Sorted consolidates groups of related records by key into a single output record with a vector field for each group, and optionally computes summary fields in the output record for each group. Denormalize Sorted requires grouped input.
For example, if you have a record for each person that includes the households to which that person belongs, Denormalize Sorted can consolidate those records into a record for each household that contains a variable number of people.

7.      Multistage components
Multistage component are nothing but the transform component where the records are transformed into five stages like input selection , temporary records initialisation , processing , finalization and output selection.
Examples of multistage components are aggregate, rollup, scan

  • Rollup: Rollup evaluates a group of input records that have the same key, and then generates records that either summarize each group or select certain information from each group.
  • Aggregate: Aggregate generates records that summarize groups of records. In general, use ROLLUP for new development rather than Aggregate. Rollup gives you more control over record selection, grouping, and aggregation. However, use Aggregate when you want to return the single record that has a field containing either the maximum or the minimum value of all the records in the group.
  • Scan: For every input record, Scan generates an output record that includes a running cumulative summary for the group the input record belongs to. For example, the output records might include successive year-to-date totals for groups of records. Scan can be used in continuous graphs


8.      Partition components:
Partition components are used to divide data sets into multiple sets for further processing.
There are several components are available as follows.

  • Partition by Round-robin
  • Partition by Key
  • Partition by Expression
  • Partition by Range
  • Partition by Percentage
  • Partition by Load Balance

Partition by Round-robin

  • It reads records from its input port and writes them to the flow partitions connected to its output port. Records are written to partitions in “round robin” fashion, with ‘block-size’ records going to a partition before moving on to the next.
  • Not key based.
  • Balancing is Good
  • Record independent parallelism

Partition by Key

  • Partition by Key reads records from the in port and distributes data records to its output flow partitions according to key values.
  • Reads records in arbitrary order from the in port.
  • Distributes records to the flows connected to the out port, according to the Key parameter, writing records with the same key value to the same output flow.
  • Partition by Key is typically followed by Sort.

Partition by Expression

  • Partition by Expression distributes records to its output flow partitions according to a specified DML expression.

Partition by Range

  • Partition by Range distributes records to its output flow partitions according to the ranges of key values specified for each partition. Partition by Range distributes the records relatively equally among the partitions.
  • Use Partition by Range when you want to divide data into useful, approximately equal, groups. Input can be sorted or unsorted. If the input is sorted, the output is sorted; if the input is unsorted, the output is unsorted.
  • The records with the key values that come first in the key order go to partition 0, the records with the key values that come next in the order go to partition 1, and so on. The records with the key values that come last in the key order go to the partition with the highest number.
  • Key based.
  • Balancing is depends on splitters
  • Key dependent parallelism and global ordering


Partition by Percentage

  • Partition by Percentage distributes a specified percentage of the total number of input records to each output flow.

Partition by Load Balance

  • Partition with Load Balance distributes data records to its output flow partitions, writing more records to the flow partitions that consume records faster. This component is not frequently used.
  • Not key-based.
  • Balancing is Depends on load
  • Record independent parallelism

9.      De-partition components:
It can read data from multiple flows or operations and are used to recombine data records from different flows. Departitioning combines many flows of data to produce one flow. It is the opposite of partitioning. Each departition component combines flows in a different manner.
There are several de partition components are available as follows.

  • Gather
  • Concatenate
  • Merge
  • Interleave

Gather

  • Reads data records from the flows connected to the input  port
  • Combines the records arbitrarily and writes to the output
  • Not key-based.
  • Result ordering is unpredictable.
  • Has no effect on the upstream processing.
  • Most useful method for efficient collection of data from multiple partitions and for repartitioning.
  • Used most frequently.

Concatenate

  • Concatenate appends multiple flow partitions of data records one after another
  • Not key-based.
  • Result ordering is by partition.
  • Serializes pipelined computation.
  • Useful for:
  • Appending headers and trailers.
  • Creating serial flow from partitioned data.
  • Used very infrequently.

Merge

  • Key-based.
  • Result ordering is sorted if each input is sorted.
  • Possibly synchronizes pipelined computation.
  •  May even serialize.
  • Useful for creating ordered data flows.
  • Other than the ‘Gather’, the Merge is the other ‘departitioner’ of choice.

Interleave:

  • Interleave combines blocks of records from multiple flow partitions in round-robin fashion.
  • You can use Interleave to undo the effects of  partition by round robin.