There are
several components in Abinitio to build the graph's, It is divided in two
sets:
-
Dataset Components:-Components Which holds data
-
Program Components:-Components which process data
Dataset Components
1. Input
file :
Input File
represents records read as input to a graph from one or more serial files or
from a multi file.
We can use
multiple files (of same type) as input
Click on partition radio and the click the edit button. In the edit box mention
the variable name which points the files and the variable has to be defined in
a fnx file like
export INPUT_FILES=`ls
-1 $AI_TEMP/MML/CCE*`
or in sandbox where the left column should have the variable name (INPUT_FILES) and right column should have the definition ($AI_TEMP/MML/CCE*)
This INPUT_FILES points all the files under $AI_TEMP/MML directory which are
stated with CCE
2.
Input table
Input Table
unloads records from a database into a graph, allowing you to specify as the
source either a database table or an SQL statement that selects records from
one or more tables.
3.
Output file:
Output
File represents records written as output from a graph into one or more serial
files or a multifile.
The
output file can be created in write or append mode or permission for the other
user can be controlled .
When
the target of an Output File component is a particular file (such as /dev/null,
NUL, a named pipe, or some other special file), the Co>Operating System
never deletes and re-creates that file, nor does it ever truncate it.
4.
Output table:
Output
Table loads records from a graph into a database, letting you specify the
destination either directly as a single database table, or through an SQL Statement that inserts records into one or more tables.
Program Components
1.
Sort components: Sort
component reorders data .
You can use Sort to order records before you send them to a component that
requires grouped or sorted records. It comprises two parameters
Key : Key is one of the parameters for sort components which describes the collation order.(key_specifier, required)
Name(s) of the key field(s) and the sequesnce specifier(s) you want the component to use when it orders records.
Max-core:Max-core parameter controls how often the sort component dump data from memory to disk.
(integer, required)
Maximum memory usage is in bytes.
Default is 100663296 (100 MB).
When the component reaches the number of bytes specified in the max- core paramenter, it sorts the records it has read and writes a temporary file to disk.
2.
Reformat components:
Reformat changes the record format
of data records by dropping fields, or by using DML expressions to add fields,
combine fields, or transform the data in the records. By default reformat has
got one output port but incrementing value of count parameter number. But for
that two different transform functions has to be written for each output port.
If any selection from input ports is required the select parameter can be used
instead of using ‘Filter by expression’ component before reformat
1.
Reads record from input port
2.
Record passes as argument to transform function
or xfr
3.
Records written to out ports, if the function
returns a success status
4.
Records written to reject ports, if the function
returns a failure status
Parameters of Reformat Component
•
Count
•
Transform (Xfr) Function
•
Reject-Threshold
-
Abort
-
Never Abort
-
Use Limit & Ramp
3.
Join Components:
Join reads the records from multiple
ports, operates on the records with matching keys using a multi input transform
function and writes the result into output ports.
Join deals with two activities.
1. Transforming data sources with different record format.
2. Combining data sources with the same record format.
Join Types
Inner Join:
It uses only records with matching
keys on both inputs. This type of join is used to produce an output record only
if there is a record in input 0 with a key that matches a record in input 1.
Full
Outer Join:
It uses all records from both inputs. If a record from one does not have
a matching record in the other input, a NULL record is used for the missing
record. This type of join is used to produce output for all of the records that
are present in either input 0 or input 1.
Explicit
Join (Semi Join):
It uses all records in one specified input, but records with matching
keys in the other inputs are optional. Again a NULL record is used for the
missing records.
Case1( In 0 required): In this type of join, a record is required on input 0,
but the presence of a record on input 1 with the same key is optional. There
are two key combinations that can produce here, so that it will again be
necessary to prioritize rules in the transform.
Case2( In 1 required): In this type of join, a record is required on input 1,
but the presence of a record on input 0 with the same is optional.
Join Methods:
There are two types of methods
•
Merge
Join: Using sorted inputs
•
Hash
Join: Using in-memory hash tables to group input
A
merge join is performed by sorting the two data sets to be joined according to
the join keys and then merging them together. The merge is very cheap, but the
sort can be prohibitively expensive especially if the sort spills to disk. The
cost of the sort can be lowered if one of the data sets can be accessed in
sorted order via an index, although accessing a high proportion of blocks of a
table via an index scan can also be very expensive in comparison to a full
table scan.
A
hash join is performed by hashing one data set into memory based on join
columns and reading the other one and probing the hash table for matches. The
hash join is very low cost when the hash table can be held entirely in memory,
with the total cost amounting to very little more than the cost of reading the
data sets. The cost rises if the hash table has to be spilled to disk in a
one-pass sort, and rises considerably for a multipass sort.
The
cost of a hash join can be reduced by partitioning both tables on the join
key(s). This allows the optimiser to infer that rows from a partition in one
table will only find a match in a particular partition of the other table, and
for tables having n partitions the hash join is executed as n independent hash
joins. This has the following effects:
- The size of each hash table is reduced, hence reducing the maximum amount of memory required and potentially removing the need for the operation to require temporary disk space.
- For parallel query operations the amount of inter-process messaging is vastly reduced, reducing CPU usage and improving performance, as each hash join can be performed by one pair of PQ processes.
- For non-parallel query operations the memory requirement is reduced by a factor of n, and the first rows are projected from the query earlier.
You should note
that hash joins can only be used for equi-joins, but merge joins are more
flexible.
In general, if
you are joining large amounts of data in an equi-join then a hash join is going
to be a better bet.
4.
Filter by Expression components:
Filter by
Expression filters records according to a specified DML expression.
Basically it can
be compared with the where clause of sql select statement.
Different functions can be used in the select expression of the filter by
expression component even lookup can also be used.
-Reads data records from the in port.
-Applies the expression in the select_expr
parameter to each record. If the expression returns:
- Non-0 value — Filter by Expression writes the record to the out port.
- 0 — Filter by Expression writes the record to the deselect port. If you do not connect a flow to the deselect port, Filter by Expression discards the records.
- NULL — Filter by Expression writes the record to the reject port and a descriptive error message to the error port.
Filter by
Expression stops execution of the graph when the number of reject events
exceeds the result of the following formula:
limit + (ramp *
number_of_records_processed_so_far)
5.
Normalize components:
Normalize
generates multiple output records from each of its input records. You can
directly specify the number of output records for each input record, or the
number of output records can depend on some calculation.
1. Reads the
input record.
- If you have not defined input_select, Normalize processes all records.
- If you have defined input_select, the input records are filtered as follows:
2. Performs
iterations of the normalize transform function for each input record.
3. Performs
temporary initialization.
4. Sends the
output record to the out port.
6.
Denormalize componennts:
Denormalize
Sorted consolidates groups of related records by key into a single output
record with a vector field for each group, and optionally computes summary
fields in the output record for each group. Denormalize Sorted requires grouped
input.
For example, if
you have a record for each person that includes the households to which that
person belongs, Denormalize Sorted can consolidate those records into a record
for each household that contains a variable number of people.
7.
Multistage components
Multistage
component are nothing but the transform component where the records are
transformed into five stages like input selection , temporary records
initialisation , processing , finalization and output selection.
Examples of multistage
components are aggregate, rollup, scan
- Rollup: Rollup evaluates a group of input records that have the same key, and then generates records that either summarize each group or select certain information from each group.
- Aggregate: Aggregate generates records that summarize groups of records. In general, use ROLLUP for new development rather than Aggregate. Rollup gives you more control over record selection, grouping, and aggregation. However, use Aggregate when you want to return the single record that has a field containing either the maximum or the minimum value of all the records in the group.
- Scan: For every input record, Scan generates an output record that includes a running cumulative summary for the group the input record belongs to. For example, the output records might include successive year-to-date totals for groups of records. Scan can be used in continuous graphs
8.
Partition components:
Partition components are used to
divide data sets into multiple sets for further processing.
There are several components are
available as follows.
- Partition by Round-robin
- Partition by Key
- Partition by Expression
- Partition by Range
- Partition by Percentage
- Partition by Load Balance
Partition by Round-robin
- It reads records from its input port and writes them to the flow partitions connected to its output port. Records are written to partitions in “round robin” fashion, with ‘block-size’ records going to a partition before moving on to the next.
- Not key based.
- Balancing is Good
- Record independent parallelism
Partition by Key
- Partition by Key reads records from the in port and distributes data records to its output flow partitions according to key values.
- Reads records in arbitrary order from the in port.
- Distributes records to the flows connected to the out port, according to the Key parameter, writing records with the same key value to the same output flow.
- Partition by Key is typically followed by Sort.
Partition by Expression
- Partition by Expression distributes records to its output flow partitions according to a specified DML expression.
Partition by Range
- Partition by Range distributes records to its output flow partitions according to the ranges of key values specified for each partition. Partition by Range distributes the records relatively equally among the partitions.
- Use Partition by Range when you want to divide data into useful, approximately equal, groups. Input can be sorted or unsorted. If the input is sorted, the output is sorted; if the input is unsorted, the output is unsorted.
- The records with the key values that come first in the key order go to partition 0, the records with the key values that come next in the order go to partition 1, and so on. The records with the key values that come last in the key order go to the partition with the highest number.
- Key based.
- Balancing is depends on splitters
- Key dependent parallelism and global ordering
Partition by Percentage
- Partition by Percentage distributes a specified percentage of the total number of input records to each output flow.
Partition by Load Balance
- Partition with Load Balance distributes data records to its output flow partitions, writing more records to the flow partitions that consume records faster. This component is not frequently used.
- Not key-based.
- Balancing is Depends on load
- Record independent parallelism
9.
De-partition components:
It can read data from multiple flows or operations and are used to
recombine data records from different flows. Departitioning combines many flows
of data to produce one flow. It is the opposite of partitioning. Each
departition component combines flows in a different manner.
There are several de partition components are available as follows.
- Gather
- Concatenate
- Merge
- Interleave
Gather
- Reads data records from the flows connected to the input port
- Combines the records arbitrarily and writes to the output
- Not key-based.
- Result ordering is unpredictable.
- Has no effect on the upstream processing.
- Most useful method for efficient collection of data from multiple partitions and for repartitioning.
- Used most frequently.
Concatenate
- Concatenate appends multiple flow partitions of data records one after another
- Not key-based.
- Result ordering is by partition.
- Serializes pipelined computation.
- Useful for:
- Appending headers and trailers.
- Creating serial flow from partitioned data.
- Used very infrequently.
Merge
- Key-based.
- Result ordering is sorted if each input is sorted.
- Possibly synchronizes pipelined computation.
- May even serialize.
- Useful for creating ordered data flows.
- Other than the ‘Gather’, the Merge is the other ‘departitioner’ of choice.