1.Explain FPGA and how it is useful for query performance.
FPGA: Field Programmable Gate Array (FPGA) is located on each SPU. Netezza is different from other architectures. Netezza can do a hardware upgrade through software by using FPGA. Hardware is reconfigured during install.
While reading data from disk, the FPGA on each SPU also helps in ‘filtering’ unnecessary data before getting loaded into memory on each SPU. This way, FPGA does not overwhelm with all the data from disk.
2.What is a zone map?
Zone map in Netezza is similar (concept wise) to partitions in Oracle. Netezza maintains a map for data so that it relies on a zone map to pull only the range it is interested in. For example, if we need to pull out data from Jan 2009 till June 2009 from a table that is distributed on a date column, a zone map helps us to achieve this. Zone map is maintained by Netezza automagically, no user intervention needed. Zone mapping is done at a block (extent) level. Netezza has zone maps for all columns (not just distributed columns) and includes information such as minimum, maximum, and total number of records.
3. How do you deal with historical data, with respect to zone maps.
Sort data first, based on historical data (for example, date) and load this using nzload.
4.What are different ways to load?
- External tables
- Create table AS (aka, CTAS).
- Inserts (Eeeewee!!)
5.Does everything gets cached in Netezza (or any other data appliance).
Typically only schema and other database objects are cached in appliances. Data is not cached, in general. In most cases, data is not saved anywhere (in any cache or on the host computer) and is streamed directly from SPU to client software.
Related Courses: IBM Informix
6. What is the best data appliance?
Obviously, it all depends. This is my (limited) view:
- From features respect, Green Plum.
- Popularity with a bit of hype, Netezza.
- Matured and well respected, Teradata.
- With existing database integration, Dataupia.
* Teradata: 72 nodes (two quad-core CPUs, 32GB RAM,104 / 300GB disks per node) and manages 2.4PB.
* Greenplum: Fox Interactive Media using a 40-node, Sun X4500 with two dual-core CPUs, 48 / 500GB disks, and 16 GB RAM (1PB total disk space)
Source: Vertica’s Michael Stonebraker!
7.How is load achieved in Netezza and why is that quick / fast?
Loads bypass a few steps that typically a query would go through (a query goes through plan generation, optimization and transaction management). Loads are done in terms of “sets’ ‘ and this set is based on underlying table structure (thus loads for two different tables are different as their sets are based on table structures). Data is processed to check format and distribution of records calculated very quickly (in one step), fills into the ‘set’ structure and writes to the storage structure. Storage also performs space availability and other admin tasks; all these operations go pretty quickly (think of them as UNIX named pipes that stream data and SPU stores these records).
8.When are we likely to receive incorrect (aggregate) results.
Very rarely a driver may return aggregated results that are still getting processed back to the client. In this case, the client may assume that calculation is complete, instead of updating with the latest or final results. Obviously, the driver has to wait for Netezza to complete the operation on the host computer, before delivering results.
Related Courses: Tibco Spotfire
9.Explain how data gets stored in Netezza and how SPU failover takes place.
Data is stored based on a selected field(s) which are used for distribution.
==Data (A)==> Hash Function (B) ==> Logical SPU identifier list (C) ==> Physical SPU list (D) ==> Storage (E)
When data arrives, it is hashed based on field(s) and a hash function (B) is used for this purpose. For example, for a hypothetical 32 node system system, the logical CPU identifier list has 32 unique entries. If there are 1000 hashed data items from (B), there are 1000 entries in (C), all having only 32 SPU entries (a number of data items go to the same SPU, thus multiple (B) entries map to the same (C)). For instance, (C) has values [3,19,30,7,20,25,11,3,22,19….]. This way, 1000 data entries are mapped. (D) has a physical IP address of both primary and failover SPU. If there is a failover, this is the only place where Netezza needs to update its entries. Same goes for a system that has a new SPU added. It is a little complicated, in principle, this is the concept.
10.What are the 4 environment variables that are required? What are the different states on Netezza?
Environment variables: NZ_HOST, NZ_DATABASE, NZ_USER and NZ_PASSWORD
* Online: Normal or usual state.
* Stopped: Netezza will shutdown after completing current queries, no new queries allowed.
* Offline: Waits for completion of current queries, new or queries in queue receive error.
* Paused: Same as above, but no error displayed. Typically caused by the Netezza Boot or startup.
* Down: Just plain down, could be due to a Netezza server problem or user-initiated.
11.Does Netezza support concurrent updates of the same record?
In case of a conflict in which the same record is set for modification, Netezza rolls back recent transactions that are attempted on the same record, in fact on the same table. This is generally acceptable in DW environments. Netezza does support serialization transactions and does not permit dirty reads.
12. How Netezza updates records. Give an idea of how transactions are maintained and how read consistency is maintained.
Netezza does not update records in place; it marks records with the delete flag. In fact, each record contains two slots, one for creating xid another for delete xid. Delete xid allows us to mark a record with a current transaction for deletion; up to 31 transactions are allowed in Netezza for all tables. As noted earlier, only one update at a time is allowed on the same table though. Here update refers to transactions that are not committed yet. Coming back to delete xid, this is how Netezza maintains transaction rollback and recovery. Once a record is modified, it’s delete xid is given transaction id; this is changed from previous value of 0, all records when loaded will contain 0 for delete xid. Note that FPGA uses its intelligence to scan data before delivering them to host or applications.
[ROW id][Create xid][Delete xid]
[R1][T1] // First time a record is loaded, record R1
// After some time, updating the same record
[R1][T1][T33] // Record R1 is updated; note T33
[R33][T33] // New update record R33; similar to a new record this has zero for Delete Xid
If the record is deleted, simply deletion xid will contain that transaction id.
* Based on the above, how do you know a record is the latest. It has zero in the delete xid flag.
* Extending the same logic, how do we know a record is deleted? It has a zero value in the delete xid flag.
* How do you roll back to transaction. Similar to the above listing, we can roll back a transaction of our interest.
Note that transaction id is located in the create xid flag and that is our point of interest in this case. From what I know, row id and create id are never modified by Netezza.
13.What happens to records that are loaded during the nzload process, but were not committed?
They are logically deleted and administrators can run nz reclaim; we may also truncate tables.
14. Can a group become a member of another group in Netezza user administration? Can we use the same group name for databases?
In Netezza, a public group is created automatically and every one is a member of this group by default. We can create as many groups and any user can be a member of any group(s). You can not be a member of another group. Group names, usernames and database names are unique. That is, we can not have a database called sales and a group also called sales.
15.How can we give a global permission to user joe so that he can create tables in any database.
Login into the system database and give that permission to the user by saying “grant create table to joe;”
16.What permission will you give to connect to a database.
List. Grant list, select on table to public (if logged into sales database, this allows all users to query tables in sales database).
17.Do we need to drop all tables and objects in that database, before dropping a database.
No, the drop database will take care of it.
18.What constraints on a table are enforced?
Not null and default. Netezza does not apply PK and FK.
19.Why NOT NULL specification is better in Netezza.
Specifying non-null results in better performance as NULL values are tracked at row header level. Having NULL values results in storing references to NULL values in the header. If all columns are NOT NULL, then there is no record header.
20.Create Table AS (CTAS), does it distribute data randomly or based on the table on which it received data?
Response: Newly created table from CTAS gets distribution from the original table.
21.Why do you prefer truncate instead of drop table command.
Just empties data from the table, keeping table structure and permission intact.
22.When no distribution clause is used while creating a table, what distribution is used by Netezza?
First column (same as in Teradata).
23.Can we update all columns in a Netezza Table.
No, the column that is used in the distribution clause cannot be used for updates. Remember, up to four columns can be used for distribution of data on SPU. From a practical sense, updating distribution columns results in redistribution of data; the single highest performance hit when a large table is involved. This restriction makes sense.
24.What is data slice and SPU?
For me, they are the same! Of course, this answer is not an accurate reply in your interview(s).
25. What data type works best for zone maps.
Zone maps work best for integer data types.