Thinking in Hive or Spark SQL - Interview questions

1) Table 1 has 100 records and table 2 has 10 records. If I do LEFT JOIN it is expected to get 100 records. But, 'm getting more records . How?

2) There are 2 data sets

Table 1
1,a
2,b
3,c

Table 2
1
2

I need to select all the records from Table 1 that are there in Table 2. Hot can I do that?

Solution 1 - Inner JOIN
Solution 2 - Sub Query (select * from Table 1 where ID in (select ID from Table 2))

Which one is better solution and why ?

3) If a table has too many partitions. Does it impact performance ? If Yes, how can we solve problem of too many partitions in Hive?

4) How will you solve "large number of small files" problem in Hive / Hadoop? What is the impact of having too many files?

5) How will you write a Pipe '|' character delimited file in Hive ?

6) Difference between ROW_NUMBER and RANK window function ?

7) I have a table as follows -

TABLE1
c1,c2,c3
1 ,2 ,3

How will I transpose column to rows? So, that output comes in a single column like

TABLE2
C,V
c1,1
c2,2
c3,3

8) What is the use of "Cluster By " clause in Hive SQL?

9) What is difference between inner & cross join?

10) How to remove duplicate records from a Table ?

11) I have a Table with 2 columns Department & Salary. I have to find Maximum Salary for each department. How will I achieve that ?

Table 1
Department, Salary
IT,7000
IT,3500
AD,200
IT,400
MG,5000
MG,45000
IT,30000
AD,20

12) I have a Table which has IN and OUT time of Employees. Like -
Table 1
Name,Time
A,10:00AM
B,08:00AM
C,05:00PM
A,4:00PM
C,09:00AM
B,06:00PM

I want to find out time difference between OUT and IN . So, that I know how much time every employee is spending in office. Result should be like -
Table 2
Name,Time
A, 4:00 PM - 10:00AM
B, 6:00PM - 8:00AM
C, 9:00AM - 5:00PM

How will I achieve it ? Can you think of any other solution to achieve same ?

13) Giving Hints in a SQL. If, I mention a Broadcast HINT in a SQL. Does it enforce that Job will have to do a broadcast or it is just a hint given to execution engine (MR, TEZ or Spark) that a broadcast join can be done?

14) I have data such that -
Table 1
C1,C2
A,V1
A,V2
A,V3
B,V2
B,V4
B,V1

I have to transpose data above from rows into columns. So, that data looks like -
Table 2
C1,C2,C3,C4
A,V1,V2,V3
B,V2,V4,V1

How will I do that ?

15) Non-Equi joins with Hive/ Spark?

16)

Tech Devins

Search This Blog