SQL projects can be authored in PostgreSQL, Snowflake, Redshift, or BigQuery. The process is the same regardless of which technology chosen, however, there are some extra steps and different approaches compared to authoring R or Python projects.
Setting up a database
The Content team will set up your database and share connection details with you. You will need to share your data with them to do this.
Sharing your data with the Content team
You need to provide your Content team point of contact with the relevant data files. Specifically, this must include:
File(s) containing the data.
These should be in
.csv
,.xlsx
,.parquet
, or.json
format. If you wish to use a different format, then please discuss it with the Content team.
File(s) to create the tables.
You'll need to provide a
.sql
file with commands to create the table(s) in your chosen technology along with any other required commands, e.g., set stage and specify file format for Snowflake.This should detail the table name(s), column names, data types, and the relationship between tables (if appropriate).
Below is an example of a PostgreSQL database containing four tables created from four
.csv
files:
DROP TABLE orders;
CREATE TABLE orders (
order_id NUMBER PRIMARY KEY,
order_date DATE,
order_time TIME
);
DROP TABLE pizza_type;
CREATE TABLE pizza_type (
pizza_type_id VARCHAR(50) PRIMARY KEY,
name VARCHAR(100),
category VARCHAR(50),
ingredients VARCHAR(500)
);
DROP TABLE pizzas;
CREATE TABLE pizzas (
pizza_id VARCHAR(50) PRIMARY KEY,
pizza_type_id VARCHAR(50) FOREIGN KEY REFERENCES pizza_types(pizza_type_id),
size VARCHAR(1),
price FLOAT
);
DROP TABLE order_details;
CREATE TABLE order_details (
order_details_id NUMBER PRIMARY KEY,
order_id NUMBER FOREIGN KEY REFERENCES orders(order_id),
pizza_id VARCHAR(50) FOREIGN KEY REFERENCES pizzas(pizza_id),
quantity NUMBER
);
\copy order_details FROM 'order_details.csv' DELIMITER ',' CSV HEADER;
\copy orders FROM 'orders.csv' DELIMITER ',' CSV HEADER;
\copy pizza_types FROM 'pizza_types.csv' DELIMITER ',' CSV HEADER;
\copy pizzas FROM 'pizzas.csv' DELIMITER ',' CSV HEADER;
Structuring the Sample Code
Sample Code in SQL is different from Sample Code in R or Python. Specifically, you should create one SQL code cell per query required for the project, and assign the output variable to a name so that it can be tested in the SCTs. In the example below from this project, we can see three Snowflake SQL cells with the output names most_popular_transport_types
, emirates_airline_popularity
, and least_popular_years_tube
(set at the top of the cell, i.e., "DataFrame available as <variable_name>").:
Instructions
As with any project, the Instructions should not detail how to solve the project, but should provide sufficient detail about any requirements of the output. In the case of this project, the first output must:
1) Be named as most_popular_transport_types
(this has been provided for the learner, and should not be edited).
2) Contain two columns called JOURNEY_TYPE
and TOTAL_JOURNEYS_MILLIONS
, and
3) Be sorted by TOTAL_JOURNEYS_MILLIONS
in descending order.
We provide the following instructions for learners to interpret in order to complete this task:
Additionally, we include text and a screenshot clarifying how to access the database and which cell should be used for each task (in the example project we have three tasks, so there are three SQL code cells):
Guides
Guides for SQL projects should follow a structure of one step per set of Instructions. For example, in this project, there are three tasks and associated SQL code cells, so there are three steps:
1) Finding the most popular transport types.
2) Identifying the most popular months and years for Emirates Airline cable car travel.
3) Least popular years for the tube.
Solution Code
This should be a single SQL script where each query is separated by a semi-colon ;
. It should contain a code comment with the output variable and another clarifying which cell to copy the code into. For example, in this project, we include the following code for solving the first task:
Submission Correctness Tests (SCTs)
Outputs of SQL queries are stored as pandas DataFrames. Therefore, SCTs for SQL projects are written in Python. For example, here is the output of the first query from Exploring London's Travel Network, which is saved as a pandas DataFrame called most_popular_transport_types
:
We can perform tests on this output. such as checking that a DataFrame most_popular_transport_types
exists, the data type, the number of columns and rows, the column names, and the values, using the Python code below. Note that sct.py
will exit when the first assert statement fails, so the order of the tests is important.
import pandas as pd
# most_popular_transport_types
assert most_popular_transport_types is not None, "Did you store the results of your query as a pandas DataFrame called most_popular_transport_types? Please use the SQL cell provided to complete the query."
assert isinstance(most_popular_transport_types, pd.DataFrame), "Did you create a query with the results saved as a pandas DataFrame called most_popular_transport_types?"
assert most_popular_transport_types.shape == (6,2), \
"Expected most_popular_transport_types to contain six rows and two columns. Please recheck your query."
assert set(most_popular_transport_types.columns.str.lower()) == set(["journey_type", "total_journeys_millions"]), \
"Expected different column names for most_popular_transport_types. Did you remember to alias the second column?"
assert set(most_popular_transport_types["JOURNEY_TYPE"].values) == set(['Bus', 'Emirates Airline', 'Overground', 'TfL Rail', 'Tram', 'Underground & DLR']), \
"most_popular_transport_types should contain six different values for the journey_types columns. Please recheck your query."
assert most_popular_transport_types[most_popular_transport_types["JOURNEY_TYPE"] == "Bus"]["TOTAL_JOURNEYS_MILLIONS"].values.round(2) == 24905.19, \
"Expected a different number of TOTAL_JOURNEYS_MILLIONS for Bus journeys. Please recheck your query."
assert most_popular_transport_types[most_popular_transport_types["JOURNEY_TYPE"] == "Underground & DLR"]["TOTAL_JOURNEYS_MILLIONS"].values.round(2) == 15020.47, \
"Expected a different number of TOTAL_JOURNEYS_MILLIONS for Underground & DLR journeys. Please recheck your query."
assert most_popular_transport_types[most_popular_transport_types["JOURNEY_TYPE"] == "Overground"]["TOTAL_JOURNEYS_MILLIONS"].values.round(2) == 1666.85, \
"Expected a different number of TOTAL_JOURNEYS_MILLIONS for Overground journeys. Please recheck your query."
assert most_popular_transport_types[most_popular_transport_types["JOURNEY_TYPE"] == "TfL Rail"]["TOTAL_JOURNEYS_MILLIONS"].values.round(2) == 411.31, \
"Expected a different number of TOTAL_JOURNEYS_MILLIONS for TfL Rail journeys. Please recheck your query."
assert most_popular_transport_types[most_popular_transport_types["JOURNEY_TYPE"] == "Tram"]["TOTAL_JOURNEYS_MILLIONS"].values.round(2) == 314.69, \
"Expected a different number of TOTAL_JOURNEYS_MILLIONS for Tram journeys. Please recheck your query."
assert most_popular_transport_types[most_popular_transport_types["JOURNEY_TYPE"] == "Emirates Airline"]["TOTAL_JOURNEYS_MILLIONS"].values.round(2) == 14.58, \
"Expected a different number of TOTAL_JOURNEYS_MILLIONS for Emirates Airline journeys. Please recheck your query."
assert most_popular_transport_types[most_popular_transport_types["TOTAL_JOURNEYS_MILLIONS"] == most_popular_transport_types["TOTAL_JOURNEYS_MILLIONS"].max()].index.values == 0, \
"Did you sort the results of most_popular_transport_types by TOTAL_JOURNEYS_MILLIONS in descending order?"
assert most_popular_transport_types[most_popular_transport_types["TOTAL_JOURNEYS_MILLIONS"] == most_popular_transport_types["TOTAL_JOURNEYS_MILLIONS"].min()].index.values == 5, \
"Did you sort the results of most_popular_transport_types by TOTAL_JOURNEYS_MILLIONS in descending order?"
For more information about authoring SCTs please see this article.
Description
There are no changes to this section compared to authoring R or Python projects. See this article for more information.
Resources
There are no changes to this section compared to authoring R or Python projects. See this article for more information.