- What is Datastage?
DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition, the Enterprise Edition, and the MVS Edition.
- What are the primary usages of the Datastage tool?
Datastage is an ETL tool that is primarily used for extracting data from source systems, transforming that data and finally loading it to target systems.
- What are the different common services in Datastage?
Below is the list of common services in Datastage:
- Metadata services
- Unified service deployment
- Security services
- Looping and reporting services.
- How do you start developing a Datastage project?
The very first step is to create a Datastage job on the Datastage server. All the Datastage objects that we create are stored in the Datastage project. A Datastage project is a separate environment on the server for jobs, tables, definitions, and routines. A Datastage project is a separate environment on the server for jobs, tables, definitions, and routines.
- What are DataStage sequences?
Datastage sequence connects the DataStage jobs in a logical flow.
6. What are the Processing stages?
The processing stages allow us to apply the actual data transformation. For example, The ‘aggregator’ stage under the Processing category allows us to apply all the ‘group by’ operations. Similarly, we have other stages in Processing like the ‘Join’ stage that allows us to join together the data coming from two different input streams.
7. What are routines in Datastage? Enlist various types of routines.
Routine is a set of functions that are defined by the DS manager. It is run via the transformer stage.
There are 3 kinds of routines:
- Parallel routines
- Mainframe routines
- Server routines
8. How do you remove duplicate values in DataStage?
There are two ways to handle duplicate values
We can use remove duplicate stage to eliminate duplicates.
We can use the Sort stage to remove duplicate. The sorting stage has a property called ‘allow duplicates’. We won’t get duplicate values in the output of sort on setting this property equal to false.
9. What are the different kinds of views available in a Datastage director?
There are 3 kinds of views available in the Datastage director. They are:
- Log view
- Status view
- Job view
10. Explain the DataStage parallel Extender or Enterprise Edition (EE)?
Parallel extender in DataStage is the data extraction and transformation application for the parallel processing.
There are two types of parallel processing’s are available they are:
- Pipeline Parallelism
- Partition Parallelism
11. What is a conductor node in DataStage?
Actually every process contains a conductor process where the execution was started and a section leader process for each processing node and a player process for each set of combined operators and an individual player process for each uncombined operator. Whenever we want to kill a process we should have to destroy the player process and then section leader process and then conductor process.
12. What are Stages in Datastage?
Stages are the basic building blocks in InfoSphere DataStage, providing a rich, unique set of functionality that performs either a simple or advanced data integration task. Stages represent the processing steps that will be performed on the data.
13. What are Links in Datastage?
A link is a representation of a data flow that joins the stages in a job. A link connects data sources to processing stages, connects processing stages to each other, and also connects those processing stages to target systems. Links are like pipes through which the data flows from one stage to the next.
14. What are Jobs in Datastage?
Jobs include the design objects and compiled programmatic elements that can connect to data sources, extract and transform that data, and then load that data into a target system. Jobs are created within a visual paradigm that enables instant understanding of the goal of the job.
15. What are Sequence jobs in Datastage?
A sequence job is a special type of job that you can use to create a workflow by running other jobs in a specified order. This type of job was previously called a job sequence.
16. What are Table definitions?
Table definitions specify the format of the data that you want to use at each stage of a job. They can be shared by all the jobs in a project and between all projects in InfoSphere DataStage. Typically, table definitions are loaded into source stages. They are sometimes loaded into target stages and other stages.
17. What are Containers in Datastage?
Containers are reusable objects that hold user-defined groupings of stages and links. Containers create a level of reuse that allows you to use the same set of logic several times while reducing the maintenance. Containers make it easy to share a workflow, because you can simplify and modularize your job designs by replacing complex areas of the diagram with a single container.
18. What are Projects in Datastage?
A project is a container that organizes and provides security for objects that are supplied, created, or maintained for data integration, data profiling, quality monitoring, and so on.
19. What is Parallel processing design?
InfoSphere DataStage brings the power of parallel processing to the data extraction and transformation process. InfoSphere DataStage jobs automatically inherit the capabilities of data pipelining and data partitioning, allowing you to design an integration process without concern for data volumes or tim constraints, and without any requirements for hand-coding.
20. What are Operators in Datastage?
A single stage might correspond to a single operator, or a number of operators, depending on the properties you have set, and whether you have chosen to partition or collect or sort data on the input link to a stage. At compilation, InfoSphere DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous, or insert other operators if they are needed for the logic of the job.
21. What is OSH in Datastage?
OSH is the scripting language used internally by the parallel engine.
22. What is splitsubrec restructure operator and what it does?
splitsubrec restructure operator separates input subrecords into sets of output top-level vector fields
23. What is splitvect restructure operator and what it does?
splitvect restructure operator promotes the elements of a fixed-length vector to a set of similarly-named top-level fields
24. What is tagbatch restructure operator and what it does?
tagbatch restructure operator converts tagged fields into output records whose schema supports all the possible fields of the tag cases.
25. What is tagswitch restructure operator and what it does?
The contents of tagged aggregates are converted to InfoSphere DataStage compatible records.
26. What is a dynamic array in DataStage?
Dynamic arrays map the structure of DataStage file records to character string data. Any character string can be a dynamic array. A dynamic array is a character string containing elements that are substrings separated by delimiters.
27. Explain BCP stage?
The BCP stage is used to store a big amount of data in one target table of Microsoft SQL Server.
28. What is the need of a link partitioner and link collector in DataStage?
In Datastage, Link Partitioner is used to split data into various parts by certain partitioning methods. Link Collector is used to collect data from many partitions to a single data and save it in the target table.
29. State the difference between DataStage and DataStage TX?
Datastage is a tool from ETL (Extract, Transform and Load) and DataStage TX is a tool from EAI (Enterprise Application Integration).
30. What is the quality state in DataStage?
The quality state is used for cleansing the data with the DataStage tool. It is a client-server software tool that is provided as part of the IBM Information Server.
31. What is job control in DataStage?
This tool is used for controlling a job or executing multiple jobs in a parallel manner. It is deployed using the Job Control Language within the IBM DataStage tool.
32. How to do DataStage jobs performance tuning?
First, we have to select the right configuration files. Then, we need to select the right partition and buffer memory. We have to deal with the sorting of data and handling null-time values. We need to try to use modify, copy, or filter instead of the transformer. Reduce the propagation of unnecessary metadata between various stages.
33. What is a repository table in DataStage?
The term ‘repository’ is another name for a data warehouse. It can be centralized or distributed. The repository table is used for answering ad-hoc, historical, analytical, or complex queries.
34. How can we kill a DataStage job?
To kill a DataStage job, we need to first kill the individual processing ID so that this ensures that the DataStage is killed.
35. How do we compare the Validated OK with the Compiled Process in DataStage?
The Compiled Process ensures that the important stage parameters are mapped and these are correct such that it creates an executable job. Whereas in the Validated OK, we make sure that the connections are valid.
36. Explain the feature of data type conversion in DataStage.
If we want to do data conversion in DataStage, then we can use the data conversion function. For this to be successfully executed, we need to ensure that the input or the output to and from the operator is the same, and the record schema needs to be compatible with the operator.
37. What is the significance of the exception activity in DataStage?
Whenever there is an unfamiliar error happening while executing the job sequencer, all the stages after the exception activity are run. So, this makes the exception activity so important in the DataStage.
38. How do we clean a DataStage repository?
For cleaning a DataStage repository, we have to go to DataStage Manager > Job in the menu bar > Clean up Resources. If we want to further remove the logs, then we need to go to the respective jobs and clean up the log files.
39. How do we call a routine in DataStage?
Routines are stored in the Routine branch of the DataStage repository. This is where we can create, view, or edit all the Routines. The Routines in DataStage could be the following: Job Control Routine, Before-after Subroutine, and Transform function.
40. What is the difference between an Operational DataStage and a Data Warehouse?
An Operational DataStage can be considered as a staging area for real-time analysis for user processing; thus it is a temporary repository. Whereas, the data warehouse is used for long-term data storage needs and has the complete data of the entire business.
41. What does NLS mean in DataStage?
NLS means National Language Support. This means we can use this IBM DataStage tool in various languages like multi-byte character languages (Chinese or Japanese). We can read and write in any language and process it as per the requirement.
42. Why do we use Link Partitioner and Link Collector in Datastage?
In Datastage, Link Partitioner is used to divide data into different parts through certain partitioning methods. Link Collector is used to gather data from various partitions/segments to a single data and save it in the target table.
43. What is the difference between Datastage 7.5 and 7.0?
In Datastage 7.5 many new stages are added for more robustness and smooth performance, such as Procedure Stage, Command Stage, Generate Report etc.
44. In Datastage, how you can fix the truncated data error?
The truncated data error can be fixed by using ENVIRONMENT VARIABLE ‘ IMPORT_REJECT_STRING_FIELD_OVERRUN’.
45. How to manage date conversion in Datastage?
We can use date conversion function for this purpose i.e. Oconv(Iconv(Filedname,”Existing Date Format”),”Another Date Format”).
46. Why do we use exception activity in Datastage?
All the stages after the exception activity in Datastage are executed in case of any unknown error occurs while executing the job sequencer.
47. Define APT_CONFIG in Datastage?
It is the environment variable that is used to identify the *.apt file in Datastage. It is also used to store the node information, disk storage information and scratch information.
48. Differentiate between Datastage and Datastage TX?
Datastage is a tool from ETL (Extract, Transform and Load) and Datastage TX is a tool from EAI (Enterprise Application Integration).
49. What is size of a transaction and an array means in a Datastage?
Transaction size means the number of row written before committing the records in a table. An array size means the number of rows written/read to or from the table respectively.
50. What a datastage macro?
In datastage macro can be used in expressions, job control routines and before or after subroutines. The available macros are concerned with ascertaining job status.