B561 Assignment 7

Testing Effectiveness of Query Optimization;

Object-Relational Database Programming

Key-Value Databases and Graph Databases

(Draft)

This assignment focuses on problems related to Lecture 9 and Lectures 18

through 22.

• Lecture 18: Algorithms for RA operations

• Lecture 19: Query processing and query plans

• Lecture 20: Object-relational database programming

• Lecture 20: Key-value stores. NoSQL in MapReduce style

• Lecture 21: Key-value stores; NoSQL in Spark style

• Lecture 22: Graph databases

Other lectures that a relevant for this assignment are Lectures 8, 13, and 14:

• Lecture 8: Translating Pure SQL queries into RA expressions

• Lecture 9: Query optimization

• Lecture 13: Object-Relational databases and queries

• Lecture 14: Nested Relational, Semi-structured Databases, Document

Databases

This assignment has problems that are required to be solved. Others, identified as such, are practice problems that you should attempt since the serve as

preparation for the final exam.

Turn in a single assignment7.sql file that contains the PostgreSQL code of

the solutions for the problem that require such code. (Do not include solutions

for the practice problems.) Also turn in a assignment7.txt file that contains

all the output associated with the problems in this assignment. For all the other

problems, submit a single assignment7.pdf file with your solutions.

1

1 Analysis of Queries Using Query Plans

Consider Lecture 19 on Query Processing: Query Planning in (Object) Relational Systems. Consider the analysis, using query plans, for the SOME quantifier.

1. Assume the relation schemas P(x), Q(x), R(x, y) and S(x, z).

Consider the NOT ALL generalized query

{(p.x, q.x)|P(p) ∧ Q(p) ∧ R(p.x) 6⊃ S(p.x)}

where

R(p.x) = {r.y | R(r) ∧ r.x = p.x}

S(q.x) = {s.z | S(s) ∧ s.x = q.x}

Consider Lecture 19 on Query Processing: Query Planning in (Object)

Relational Systems and in particular the analysis, using query plans, for

the SOME generalized quantifier.

Now to the problem. In analogy with the analysis for the SOME generalized

quantifier, do an analysis for the NOT ALL generalized quantifier.

2 Experiments to Test the Effectiveness of Query

Optimization

In the following problems, you will conduct experiments in PostgreSQL to gain

insight into whether or not query optimization can be effective. In other words,

can it be determined experimentally if optimizing an SQL or an RA expression

improves the time (and space) complexity of query evaluation? Additionally,

can it be determined if the PostgreSQL query optimizer attains the same (i.e.,

better or worse) optimization as optimization by hand. Recall that in SQL you

can specify each RA expression as an RA SQL query. This implies that each of

the optimization rules for RA can be applied directly to queries formulated in

RA SQL.

In the following problems you will need to generate artificial data of increasing size and measure the time of evaluating non-optimized and optimized

queries. The size of this data can be in the ten or hundreds of thousands of

tuples. This is necessary because on very small data is it is not possible to gain

sufficient insights into the quality (or lack of quality) of optimization. You can

use the data generation functions that were developed in Assignment 6. Additionally, you are advised to examine the query plans generated by PostgreSQL.

For the problems in this assignments, we will use three relations:1

P(a int)

R(a int, b int)

S(b int)

1A typical case could be where P is Person, R is Knows, and S is the set of persons with the

Databases skill. Another case could where P is the set of persons who work for Amazon, R is

personSkill and S is the set of skills of persons who live in Bloomington. Etc.

2

To generate P or S, you should use the function SetOfIntegers which generate a set of up to n randomly selected integers in the range [l, u]:

create or replace function SetOfIntegers(n int, l int, u int)

returns table (x int) as

$$

select floor(random() * (u-l+1) + l)::int as x

from generate_series(1,n)

group by (x) order by 1;

$$ language sql;

To generate R, you should use the function BinaryRelationOverIntegers

which generates up to n randomly selected pairs with first components in the

range [l1, u1] and second components in the range [l2, u2]:

create or replace function BinaryRelationOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int)

returns table (x int, y int) as

$$

select floor(random() * (u_1-l_1+1) + l_1)::int as x,

floor(random() * (u_2-l_2+1) + l_2)::int as y

from generate_series(1,n)

group by (x,y) order by 1,2;

$$ language sql;

Example 1 Consider the query Q1

select distinct r1.a

from R r1, R r2

where r1.b = r2.a;

This query can be translated and optimized to the query Q2

select distinct r1.a

from R r1 natural join (select distinct r2.a as b from R r2) r2;

Image that you have generated a relation R. Then when you execute

explain analyze

select distinct r1.a

from R r1, R r2

where r1.b = r2.a;

the system will return its query plan as well as the execution time to evaluate

Q1 measured in ms. And, when you execute

explain analyze

select distinct r1.a

from R r1 natural join (select distinct r2.a as b from R r2) r2;

the system will return its query plan as well as the execution time to evaluate

Q2 measured in ms. This permits us to compare the non-optimized query Q1

3

with the optimized query Q2 for various differently-sized relations R. Here are

some of these comparisons for various differently-sized random relations R. In

this table, R was generated with lower and upper bounds l1 = l2 = 1000 and

u1 = u2 = 1000.

2

R Q1 (in ms) Q2 (in ms)

104 27.03 7.80

105 3176.53 58.36

106 69251.58 400.54

Notice the significant difference between the execution times of the non-optimized

query Q1 and the optimized query Q2. So clearly, optimization works on query

Q1.

Incidentally, below are the query plans for Q1 and Q2. Examining these

query plans should reveal why Q1 runs much slower than Q2. (Why?)

QUERY PLAN for Q1

————————————

HashAggregate

Group Key: r1.a

-> Hash Join

Hash Cond: (r1.b = r2.a)

-> Seq Scan on r r1

-> Hash

-> Seq Scan on r r2

QUERY PLAN for query Q2

——————————————

HashAggregate

Group Key: r1.a

-> Hash Join

Hash Cond: (r1.b = r2.a)

-> Seq Scan on r r1

-> Hash

-> HashAggregate

Group Key: r2.a

-> Seq Scan on r r2

2All the experiments where done on a MacMini.

4

We now turn to the problems for this section.

2. Consider query Q3

select distinct p.a

from P p, R r1, R r2, R r3, S s

where p.a = r1.a and r1.b = r2.a and r2.b = r3.a and r.b = S.b;

Intuitively, if we view R as a graph, and P and S as node types (properties),

then Q3 determines each P-node in the graph from which there emanates

a path of length 3 that ends at a S-node.3

I.e., a P-node n0 is in the

answer if there exists sequence of nodes (n0, n1, n2, n3) such that (n0, n1),

(n1, n2), and (n2, n3) are edges in R and n3 is a S-node.

(a) Translate and optimize this query and call it Q4. Then write Q4 as

an RA SQL query just as was done for query Q2 in Example 1.

(b) Compare queries Q3 and Q4 in a similar way as we did for Q1 and

Q2 in Example 1.

You should experiment with different sizes for R. Incidentally, these

relations do not need to use the same parameters as those shown in

the above table for Q1 and Q2 in Example 1.

(c) What conclusions do you draw from the results of these experiments

regarding the effectiveness of query optimization in PostgreSQL and/or

by hand?

3Such a query is typical in Graph Databases.

5

3. Consider the Pure SQL Q5 which is an formulation of a variation of the

not subset (not only) set semijoin query

{p.a | P(p) ∧ R(p.a) 6⊆ S}

where

R(p.a) = {r.b | R(r) ∧ r.a = p.a}.

select p.a

from P p

where exists (select 1

from R r

where r.a = p.a and

not exists (select 1 from S s where r.b = s.b));

(a) Translate and optimize this query and call it Q6. Then write Q6 as

an RA SQL query just as was done for Q2 in Example 1.

(b) An alternative way to write a query equivalent with Q5 is as the

object-relational query

with nestedR as (select P.a, array_agg(R.b) as bs

from P natural join R

group by (P.a)),

Ss as (select array(select b from S) as bs)

select a

from nestedR

where not (bs <@ (select bs from Ss));

Call this query Q7.

Compare queries Q5, Q6, and Q7 in a similar way as we did in Example 1. However, now you should experiment with different sizes

for P, R and S as well as consider how P and S interact with R.

(c) What conclusions do you draw from the results of these experiments?

6

4. Consider the Pure SQL Q8 which is an formulation of a variation of the

not superset, (not all) set semijoin query

{p.a | |P(p) ∧ R(p.a) 6⊇ S}

where

R(p.a) = {r.b | R(r) ∧ r.a = p.a}.

select p.a

from P p

where exists (select 1

from S s

where not exists (select 1 from R where p.a = r.a and r.b = s.b));

(a) Translate and optimize this query and call it Q9. Then write Q9 as

an RA SQL query just as was done for Q2 in Example 1.

(b) An alternative way to write a query equivalent with Q8 is as the

object-relational query

with nestedR as (select P.a, array_agg(R.b) as bs

from P natural join R

group by (P.a)),

Ss as (select array(select b from S) as bs)

select a

from P

where a not in (select a from nestedR) and

not((select bs from Ss) <@ ’{}’)

union

select a

from nestedR

where not((select bs from Ss) <@ bs);

Call this query Q10.

Compare queries Q8, Q9, and Q10 in a similar way as we did In

Example 1. However, now you should experiment with different sizes

for P, R and S as well as consider how P and S interact with R. For

example, it could be that

(c) What conclusions do you draw from the results of these experiments?

5. Give a brief comparison of your results for Problem 3 and Problem 4. In

particular, where the results show significant differences, explain why you

think that is the case. And, where the results show similarities, explain

why you think that is the case.

7

3 Object Relational Programming

The following problems require you to write object relational programs. Many

of these require program written in Postgres’ plpgsql database programming

language.

6. Practice Problem–not graded Consider the relation schema V(node

int) and E(source int, target int) representing the schema for storing a directed graph G with nodes in V and edges in E.

Now let G be a directed graph that is acyclic, i.e., a graph without cycles.

4

A topological sort of an acyclic graph G is a list of all nodes (n1, n1, . . . , nk)

in V such that for each edge (m, n) in E, node m occurs before node n in

this list. Note that a path can be stored in an array.

Write a PostgreSQL program topologicalSort() that returns a topological sort of G.

7. Consider a parent-child relation PC(parent, child). (You can assume

that PC is a rooted tree and the domain of the attributes parent and

child is int.) An edge (p, c) in PC indicates that node p is a parent of

node c. Now consider a pair of nodes (m, n) in PC (m and n maybe the

same nodes.) We say that m and n are in the same generation when the

distance from m to the root of PC is the same as the distance from n to

the root of PC.

Consider the following recursive query that computes the sameGeneration

relation:

WITH RECURSIVE sameGeneration(m, n) AS

((SELECT parent, parent FROM PC) UNION (select child, child from PC)

UNION

SELECT t1.child, t2.child

FROM sameGeneration pair, PC t1, pc t2

WHERE pair.m = t1.parent and pair.n = t2.parent)

select distinct pair.m, pair.n from sameGeneration pair order by m, n;

Write a non-recursive function sameGeneration() in the language plpgsql

that computes the sameGeneration relation.

4A cycle is a path (n0, . . . , nl) where n0 = nl

.

8

8. Consider the following relational schemas. (You can assume that the domain of each of the attributes in these relations is int.)

partSubpart(pid,sid,quantity)

basicPart(pid,weight)

A tuple (p, s, q) is in partSubPart if part s occurs q times as a direct

subpart of part p. For example, think of a car c that has 4 wheels w and

1 radio r. Then (c, w, 4) and (c, r, 1) would be in partSubpart. Furthermore, then think of a wheel w that has 5 bolts b. Then (w, b, 5) would be

in partSubpart.

A tuple (p, w) is in basicPart if basic part p has weight w. A basic part

is defined as a part that does not have subparts. In other words, the pid

of a basic part does not occur in the pid column of partSubpart.

(In the above example, a bolt and a radio would be basic parts, but car

and wheel would not be basic parts.)

We define the aggregated weight of a part inductively as follows:

(a) If p is a basic part then its aggregated weight is its weight as given

in the basicPart relation

(b) If p is not a basic part, then its aggregated weight is the sum of the

aggregated weights of its subparts, each multiplied by the quantity

with which these subparts occur in the partSubpart relation.

9

Example tables: The following example is based on a desk lamp with

pid 1. Suppose a desk lamp consists of 4 bulbs (with pid 2) and a frame

(with pid 3), and a frame consists of a post (with pid 4) and 2 switches

(with pid 5). Furthermore, we will assume that the weight of a bulb is 5,

that of a post is 50, and that of a switch is 3.

Then the partSubpart and basicPart relation would be as follows:

partSubPart

pid sid quantity

1 2 4

1 3 1

3 4 1

3 5 2

basicPart

pid weight

2 5

4 50

5 3

Then the aggregated weight of a lamp is 4 × 5 + 1 × (1 × 50 + 2 × 3) = 76.

(a) Write a recursive function recursiveAggregatedWeight(p integer)

that returns the aggregated weight of a part p. Test your function.

(b) Write a non-recursive function nonRecursiveAggregatedWeight(p

integer) that returns the aggregated weight of a part p. Test your

function.

10

9. Practice problem–not graded. Consider the heap data structure. For

a description, consult

https://en.wikipedia.org/wiki/Binary heap.

(a) Implement this data structure in PostgreSQL. This implies that you

need to implement the insert and extract heap operations.

In this problem, you are not allowed to use arrays to implement this

data structure! Rather you must you relations.

(b) Then, using the heap data structure developed in question 9a, write

a PostgreSQL program heapSort() that implement the Heapsort

algorithm. For a description of this algorithm, see

https://en.wikipedia.org/wiki/Heapsort

You are not allowed to use arrays to implement this the Heapsort

algorithm!

The input format is a list of integers stored in a binary relation Data(index,value).

For example, Data could contain the following data.

Data

index value

1 3

2 1

3 2

4 0

5 7

The output of heapSort() should be stored in a relation sortedData(index,value).

On the Data relation above, this should be the following relation:

11

10. Practice problem–not graded. Suppose you have a weighted (directed)

graph G stored in a ternary table with schema

Graph(source int, target int, weight int)

A triple (s, t, w) in Graph indicates that Graph has an edge (s, t) whose

edge weight is w. (In this problem, we will assume that each edge weight

is a positive integer.)

Below is an example of a graph G.

Graph G

source target weight

0 1 2

1 0 2

0 4 10

4 0 10

1 3 3

3 1 3

1 4 7

4 1 7

2 3 4

3 2 4

3 4 5

4 3 5

4 2 6

Implement Dijkstra’s Algorithm as a PostgreSQL function Dijkstra(s

integer) to compute the shortest path lengths (i.e., the distances) from

some input vertex s in G to all other vertices in G. Dijkstra(s integer)

should accept an argument s, the source vertex, and outputs a table which

represents the pairs (t, d) where d is the shortest distance from s to t in

graph G. To test your procedure, you can use the graph shown above.

When you apply Dijkstra(0), you should obtain the following table:

target shortestDistance

0 0

1 2

2 9

3 5

4 9

12

11. Consider the relation schema document(doc int, words text[]) representing a relation of pairs (d, W) where d is a unique id denoting a

document and W denotes the set of words that occur in d.

Let W denote the set of all words that occur in the documents and let t

be a positive integer denoting a threshold. Let X ⊆ W. We say that X is

t-frequent if

count({d|(d, W) ∈ document and X ⊆ W}) ≥ t

In other words, X is t-frequent if there are at least t documents that

contain all the words in X.

Write a PostgreSQL program frequentSets(t int) that returns the set

of all t-frequent set.

In a good solution for this problem, you should use the following rule: if X

is not t-frequent then any set Y such that X ⊆ Y is not t-frequent either.

In the literature, this is called the Apriori rule of the frequent itemset

mining problem. This rule can be used as a pruning rule. In other words,

if you have determined that a set X in not t-frequent then you no longer

have to consider any of X’s supersets.

To learn more about this problem you can visit the site

https://en.wikipedia.org/wiki/Apriori algorithm.

Test your function frequentSets for thresholds 0 through 5.

13

12. Consider a directed graph G stored in a relation Graph(source int,

target int). We say that G is Hamiltonian if G has a cycle (n1, . . . nk)

such that each node n in G occurs once, but only once, as a node ni

in

this cycle.

(a) Write a recursive function recursiveHamiltonian() that returns

true if the graph stored in Graph is Hamiltonian, and false otherwise. Test your function.

(b) Write a non-recursive function nonRecursiveHamiltonian that returns true if the graph stored in Graph is Hamiltonian, and false

otherwise. Test your function.

14

4 Key-value Stores (MapReduce and Spark)

Consider the document “MapReduce and the New Software Stack” available

in the module on MapReduce.5

In that document, you can, in Sections 2.3.3-

2.3.7, find descriptions of algorithms to implement relational algebra operations

in MapReduce. (In particular, look at the mapper and reducer functions for

various RA operators.)

Remark 1 Even though MapReduce as a top-level programming language is

only rarely used, it still serves as an underlying programming environment to

which other languages compile. Additionally, the programming techniques of

applying maps to key-value stores and reducing (accumulating, aggregating) intermediate and final results is an important feature of parallel and distributed

data processing. Additionally, the MapReduce framework forces one to reason

about modeling data towards key-value stores. Finally, the fact that the MapReduce programming model can be entirely simulated in the PostgreSQL objectrelational system underscores again the versatility of this system for a broad

range of database programming and application problems.

In the following problems, you are asked to write MapReduce programs that

implement some RA operations and queries with aggregation in PostgreSQL. In

addition, you need to add the code which permits the PostgreSQL simulations

for these MapReduce programs.

Discussion A crucial aspect of solving these problems is to develop an appropriate data representation for the input to these problems. Recall that in

MapReduce the input is a single binary relation of (key, value) pairs.

We will now discuss a general method for representing (encoding) a relational database in a single key-value store. Crucial in this representation is the

utilization of json objects.6

Consider a relation R(a,b,c). For simplicity, we will assume that the domain

of the attributes of R is integer.7

create table R (a int, b int, c int);

insert into R values (1,2,3), (4,5,6), (1,2,4);

table R;

a | b | c

—–+—+—

1 | 2 | 3

4 | 5 | 6

1 | 2 | 4

5This is Chapter 2 in Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman,

and Jeffrey D. Ullman.

6

Incidentally, this modeling technique is independent of MapReduce and can also be used

to map relational data to other systems and programming languages that center around json

objects.

7However, this approach can be generalized for other domains such as string, booleans,

etc.

15

Starting from this relation R we can, using jsonb8

functions and operations on

jsonb objects, come up with an encoding of R as a key-value store. Consider

the tuple

(1, 2, 3)

in R. We will represent (encode) this tuple as the key-value pair

(‘R’,{“a”:1, “b”:2, “c”:3}).

So the key of this pair is the relation name ‘R’ and the jsonb object {“a”:

1, “b”:2, “c”: 1} represents the tuple (1, 2, 3). Based on this idea of representing tuples of R, we can generate the entire key-value store for R using an

object-relational SQL query.9 To that end, we can use the jsonb build object

PostgreSQL function.

create table encodingofR (key text, value jsonb);

insert into encodingofR

select ’R’ as key, jsonb_build_object(’a’, r.a, ’b’, r.b, ’c’, r.c) as value

from R r;

This gives the following encoding for R.

table encodingofR;

key | value

—–+—————————–

R | {“a” : 1, “b” : 2, “c” : 3}

R | {“a” : 4, “b” : 5, “c” : 6}

R | {“a” : 1, “b” : 2, “c” : 4}

Note that we can also “decode” the encodingofR key-value store to recover R

by using the following object-relational SQL query. To that end, we can use the

jsonb selector function ->.

select p.value->’a’ as a, p.value->’b’ as b, p.value->’c’ as c

from encodingofR p;

a | b | c

—–+—+—

1 | 2 | 3

4 | 5 | 6

1 | 2 | 4

An important aspect of this encoding strategy is that it is possible to put

multiple relations, possible with different schemas and arities, into the same

key-value store. Besides R, let us also consider a binary relation S(a,d).

create table S (a int, d int);

insert into S values (1,2), (5,6), (2,1), (2,3);

table S;

8PostgreSQL support both json and jsonb objects. For this assignment, you should use the

jsonb object type since it comes with more functionality and offers more efficient computation.

9Notice that this strategy works in general for any relation, independent of the number of

attributes of the relation.

16

a | d

—–+

1 | 2

5 | 6

2 | 1

2 | 3

(4 rows)

We can now encode both R and S into a single key-value store encodingofRandS

as follows:

create table encodingofRandS(key text, value jsonb);

insert into encodingofRandS

select ’R’ as key, jsonb_build_object(’a’, r.a, ’b’, r.b, ’c’, r.c) as value

from R r

union

select ’S’ as key, jsonb_build_object(’a’, s.a, ’d’, s.d) as value

from S s

order by 1, 2;

table encodingofRandS;

key | value

—–+————————–

R | {“a”: 1, “b”: 2, “c”: 3}

R | {“a”: 1, “b”: 2, “c”: 4}

R | {“a”: 4, “b”: 5, “c”: 6}

S | {“a”: 1, “d”: 2}

S | {“a”: 2, “d”: 1}

S | {“a”: 2, “d”: 3}

S | {“a”: 5, “d”: 6}

(7 rows)

Furthermore, we can decode this key-value store using 2 object-relational SQL

queries and recover R and S.

select p.value->’a’ as a, p.value->’b’ as b, p.value->’c’ as c

from encodingofRandS p

where p.key = ’R’;

a | b | c

—–+—+—

1 | 2 | 3

4 | 5 | 6

1 | 2 | 4

(3 rows)

select p.value->’a’ as a, p.value->’d’ as d

from encodingofRandS p

where p.key = ’S’;

a | d

—–+

1 | 2

5 | 6

2 | 1

2 | 3

(4 rows)

17

Example 2 Consider the following problem. Write, in PostgreSQL, a basic

MapReduce program, i.e., a mapper function and a reducer function, as well as

a 3-phases simulation that implements the set intersection of two unary relations

R(a) and S(a), i.e., the relation R ∩ S. You can assume that the domain of the

attribute ‘a’ is integer.

— EncodingOfRandS;

drop table R; drop table S;

create table R(a int);

insert into R values (1),(2),(3),(4);

create table S(a int);

insert into S values (2),(4),(5);

drop table EncodingOfRandS;

create table EncodingOfRandS(key text, value jsonb);

insert into EncodingOfRandS

select ’R’ as key, jsonb_build_object(’a’, r.a) as value

from R r

union

select ’S’ as key, jsonb_build_object(’a’, s.a) as value

from S s

order by 1;

table EncodingOfRandS;

key | value

—–+———-

R | {“a”: 1}

R | {“a”: 4}

R | {“a”: 2}

R | {“a”: 3}

S | {“a”: 4}

S | {“a”: 5}

S | {“a”: 2}

(7 rows)

— mapper function

CREATE OR REPLACE FUNCTION mapper(key text, value jsonb)

RETURNS TABLE(key jsonb, value text) AS

$$

SELECT value, key;

$$ LANGUAGE SQL;

— reducer function

CREATE OR REPLACE FUNCTION reducer(key jsonb, valuesArray text[])

RETURNS TABLE(key text, value jsonb) AS

$$

SELECT ’R intersect S’::text, key

WHERE ARRAY[’R’,’S’] <@ valuesArray;

$$ LANGUAGE SQL;

— 3-phases simulation of MapReduce Program followed by a decoding step

WITH

Map_Phase AS (

SELECT m.key, m.value

FROM encodingOfRandS, LATERAL(SELECT key, value FROM mapper(key, value)) m

18

),

Group_Phase AS (

SELECT key, array_agg(value) as value

FROM Map_Phase

GROUP BY (key)

),

Reduce_Phase AS (

SELECT r.key, r.value

FROM Group_Phase, LATERAL(SELECT key, value FROM reducer(key, value)) r

)

SELECT p.value->’a’ as a FROM Reduce_Phase p

order by 1;

a

—

2

4

(2 rows)

We now turn to the problems for this section.

13. Practice problem–not graded. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well

as a 3-phases simulation that implements the symmetric difference of two

unary relations R(a) and S(a), i.e., the relation (R−S)∪(S−R). You can

assume that the domain of the attribute ‘a’ is integer.

14. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the semijoin of two relations R(A,B) and S(A,B,C), i.e., the relation

R n S. You can assume that the domains of A, B, and C are integer.

Use the encoding and decoding methods described above.

15. Practice problem–not graded. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a

3-phases simulation that implements the natural join R ./ S of two relations R(A, B) and S(B,C). You can assume that the domains of A, B, and

C are integer. Use the encoding and decoding methods described above.

16. Write, in PostgreSQL, a basic MapReduce program, i.e., a mapper function and a reducer function, as well as a 3-phases simulation that implements the SQL query

SELECT r.A, array_agg(r.B), sum(r.B)

FROM R r

GROUP BY (r.A)

HAVING COUNT(r.B) < 3;

Here R is a relation with schema (A, B). You can assume that the domains

of A and B are integers. Use the encoding and decoding methods described

above.

19

We now turn to some problems that relate to query processing in Spark.

Note that in Spark it is possible to operate on multiple key-value stores.

17. Let R(K, V ) and S(K, W) be two binary key-value pair relations. You

can assume that the domains of K, V , and W are integers. Consider the

cogroup transformation R.cogroup(S) introduced in the lecture on Spark.

(a) Define a PostgreSQL view coGroup that computes a complex-object

relation that represent the co-group transformation R.cogroup(S).

Show that this view works.

(b) Write a PostgreSQL query that use this coGroup view to compute

the semi join R n S, in other words compute the relation R ./ πK(S).

(c) Write a PostgreSQL query that uses this coGroup view to implement

the SQL query

SELECT distinct r.K as rK, s.K as sK

FROM R r, S s

WHERE NOT ARRAY(SELECT r1.V

FROM R r1

WHERE r1.K = r.K) && ARRAY(SELECT s1.W

FROM S s1

WHERE s1.K = s.K);

18. Practice problem–not graded. Let A(x) and B(x) be the schemas to

represent two set of integers A and B. Consider the cogroup transformation introduced in the lecture on Spark. Using an approach analogous to

the one in Problem 17 solve the following problems:10

(a) Write a PostgreSQL query that uses the cogroup transformation to

compute A ∩ B.

(b) Write a PostgreSQL query that uses the cogroup operator to compute

the symmetric difference of A and B, i.e., the expression

(A − B) ∪ (B − A).

10An important aspect of this problem is to represent A and B as a key-value stores.

20

5 Graph query languages

Each of the following problems is a practice problem.

19. Practice Problem–not graded. Consider the database schema Person,

Company, companyLocation, Knows, jobSkill, and personSkill.

(a) Specify an Entity-Relationship Diagram that models this database

schema.

(b) Specify the node and relationship types of a Property Graph for

this database schema. In addition, specify the properties, if any,

associated with each such type.

20. Practice Problem–not graded. Using the Property Graph model in

Problem 19b, formulate the following queries in the Cypher query language:

(a) Find the types of the relationships associated with Person nodes.

(b) Find each person (node) whose name is ‘John’ and has a salary that

is at least 50000.

(c) Find each jobSkill (node) that is the job skill of a person who knows

a person who works for ’Amazon’ and who has a salary that is at

least 50000.

(d) Find each person (node) who knows directly or indirectly (i.e., recursively) another person who works for Amazon.

(e) Find for each company node, that node along with the number of

persons who work for that company and who have both the Databases

and Networks job skills.

21

B561

# Assignment 7 Testing Effectiveness of Query Optimization

Original price was: $35.00.$30.00Current price is: $30.00.

## Reviews

There are no reviews yet.