1.Which processor do you use to exclude the duplicate records ?
Firstly we need to identify the duplicates by using the “Duplicate check” processor providing the attributes on which you want to list duplicates.
Take only the output records of this processor from the “Non-Duplicated” port, thereby eliminating duplicates from the data stream.
2.Which Processor is used to eliminate Duplicates ?
In order to eliminate duplicates, we can use “Group and Merge” processors, which in turn has 3 sub-processors i.e. Input, Group and Merge.
Add Attributes to the Input Sub-processor to be considered in this data stream.
Add the Attribute(s) on which to eliminate the duplicate to the “Group” sub processor.
In the Merge Sub-process, select the relevant Merge function, by default its “Most Common Value”
Consider the Merged output results for the De-duplicated records.
3.What is the difference between “Lookup and Return” and “Lookup Check” Processors ?
Lookup and Return, does the look up on the Reference data/Lookup and gets back the return attribute(s), which can be used to add as new attribute(s) or to update the existing columns into data stream
Lookup Check, does the look up on the reference data/Look up to check if the attributes exist in reference data or not and does not bring back the return attributes, even though reference data is passing back.
4.How to convert the format of the Date attribute to a different format ? For example MM/DD/YYYY HH:MM:SS to DD/MM/YY
If the Attribute which contains Date is of STRING data type then convert it to Date using “Convert Date to String” Processor and again use the processor “Convert String to Date” by providing the desired Output format in the “Options” of this processor.
If the Attribute which contains Date is of DATE Data type then convert it to String by using the processor “Convert String to Date” by providing the desired Output format in the “Options” of this processor and if required you can convert it back to DATE.
5.How to Add a unique Row-Identifier to each record in EDQ ?
To generate a unique Row-identifier you can use the “Add Message Id” processor. It adds a Number attribute which assigns a sequential number to each record.
6.What is the main purpose of Lookup and Return?
Lookup and return is one of the main processors used in the EDQ for data enrichment. This processor takes one or more attributes as input and returns one or more attributes as output as per the reference data definition.
7.If you have multiple files/sources to read the data, how are going to bring all data together in one stream?
First of all create a snapshot of all the files and add a reader processor for each file and then by using the Merge processor you can bring all the files together.
P.S : All the files have to be in the same format to bring together in merge process/ you can selectively choose few columns from each file in Merge processor
8.How will you identify and eliminate duplicates in EDQ ?
In order to just identify duplicates we can use Duplicate check processor by passing one or more attributes on which duplicates needs to be identified.
In order to eliminate/merge these duplicates, we can use Group and merge processors by passing one or more attributes on which duplicates need to be merged.
9.What is the difference between Reference data and Look up ?
Reference data is an object which you create explicitly with Data and define which columns to refer and return. It holds both data and definition and is more static, i.e. data will not change dynamically.
Lookup is something which you can create using stage data and define which columns to lookup and return and data here is dynamic, i.e. every time the Staged data gets refreshed, lookup on that staged data works on refreshed data.
10.After cleansing the data in EDQ , how will you pass the data to the downstream system or external system ?
This can be done in multiple ways, few of the most popular methods are
Export the final cleansed staged data as a file(.txt,.xls etc.. )
Write the cleaned data to the Staging table in a schema outside EDQ, to do so you need to have a data store pointing to that table beforehand.
11.What are the types of external sources from which you can import data into EDQ?
EDQ can import from different types of sources like text(.txt, .dsvetc), excel (.xls, csv), and all types of databases like Oracle, DB2, Postgresql, Mysql, Microsoft Sql Server, Sybase etc..
Related Courses: Fusion Middleware and OBIEE
12.What are the objects you create in EDQ to import files or from a database?
First of all we need to create a Data store pointing to a file or database and then create and run the staged data to import data. In case of a file you can either give the local path or if its server gives the server credentials and path of the file to select the file.
13.What is the Staged data?
Staged data is where you store the intermediate or final results within your EDQ space, it’s like a EDQ table which stores the Processed data from the processes
14.What is the difference between Staged data and Reference data?
Staged data is used to store the data being processed or the final data after processing and is considered as working data.
Reference data is something which you refer to with some values in the working data and get the other corresponding values from the reference data in the same record.
Ex: Suppose you have country in your working data as “United States of America” and for the country code, you look up in the Reference data with Country name and get country code, where you have already stored all country and corresponding codes.
15.Name some of the commonly used processor
Lookup and Return
Group and Merge
16.Critical Data Quality Challenges
Data used for decision making and analytics has to be fully trustworthy. However in real life data rarely comes clean. It contains missing values, duplicate entries, misspelt words, non standardized names and various other forms of questionable data. Making critical decisions with such data results in operational inefficiencies, loss of goodwill among customers, faulty market readings and audit and compliance lapses.
17.Essential Data Quality Capabilities
Ever since there have been databases and applications, there have been data quality problems. Unfortunately all those problems are not created equal and neither are the solutions that address them. Some of the largest differences are driven by the data type, or domain, of the data in question. The most common data domains in data quality are customer (or more generally, party data including suppliers, employees, etc.) and product data. Oracle Enterprise Data Quality products recognize these differences and provide purpose-built capabilities to address each. Quick to deploy and easy to use, Oracle Enterprise Data Quality products bring the ability to enhance the quality of data to all stakeholders in any data management initiative.