The Solution Code provides the end-to-end code used to solve the problem. It should conform to coding best practices and use methods/functions/techniques that learners will be familiar with if they have completed the prerequisite course(s). It should also include code comments throughout so that learners can see why this specific approach has been taken.

R/Python

For R/Python projects the Solution Code will be a single .R or .py file that learners can copy and paste into the Workbook. Here is an example Solution Code for the Analyzing Crime in Los Angeles Python project:

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Read in and preview the dataset
crimes = pd.read_csv("crimes.csv", parse_dates=["Date Rptd", "DATE OCC"], dtype={"TIME OCC": str})
crimes.head()

## Which hour has the highest frequency of crimes? Store as an integer variable called peak_crime_hour

# Extract the first two digits from "TIME OCC", representing the hour,
# and convert to integer data type
crimes["HOUR OCC"] = crimes["TIME OCC"].str[:2].astype(int)

# Preview the DataFrame to confirm the new column is correct
crimes.head()

# Produce a countplot to find the largest frequency of crimes by hour
sns.countplot(data=crimes, x="HOUR OCC")
plt.show()

# Midday has the largest volume of crime
peak_crime_hour = 12

## Which area has the largest frequency of night crimes (crimes committed between 10pm and 4am)? 
## Save as a string variable called peak_night_crime_location
# Filter for the night-time hours
night_time = crimes[crimes["HOUR OCC"].isin([10,11,12,1,2,3,4])]

# Group by "AREA NAME" and count occurrences, filtering for the largest value and saving the "AREA NAME"
peak_night_crime_location = night_time.groupby("AREA NAME", 
                                               as_index=False)["HOUR OCC"].count().sort_values("HOUR OCC", ascending=False).iloc[0]["AREA NAME"]                                                                                     

# Print the peak night crime location
print(peak_night_crime_location)

## Identify the number of crimes committed against victims by age group (<18, 18-25, 26-34, 35-44, 45-54, 55-64, 65+) 
## Save as a pandas Series called victim_ages
# Create bins and labels for victim age ranges
age_bins = [0, 17, 25, 34, 44, 54, 64, np.inf]
age_labels = ["<18", "18-25", "26-34", "35-44", "45-54", "55-64", "65+"]

# Add a new column using pd.cut() to bin values into discrete intervals
crimes["Age Bracket"] = pd.cut(crimes["Vict Age"],
                               bins=age_bins,
                               labels=age_labels)

# Find the category with the largest frequency
victim_ages = crimes["Age Bracket"].value_counts()
print(victim_ages)

SQL

For PostgreSQL, Snowflake, Redshift, or BigQuery projects the Solution Code will be a single .SQL file. As SQL projects generally contain more than one SQL code cell, each query within the Solution Code must be separated by a semi-colon ;. Here is example of Solution Code for the Exploring London's Travel Network Snowflake project:

-- most_popular_transport_types
-- Copy code into first SQL cell
SELECT JOURNEY_TYPE, 
	SUM(JOURNEYS_MILLIONS) as TOTAL_JOURNEYS_MILLIONS
FROM TFL.JOURNEYS
GROUP BY JOURNEY_TYPE
ORDER BY TOTAL_JOURNEYS_MILLIONS DESC;

-- emirates_airline_popularity
-- Copy code into second SQL cell
SELECT MONTH, 
	YEAR, 
	ROUND(JOURNEYS_MILLIONS,2) AS ROUNDED_JOURNEYS_MILLIONS
FROM TFL.JOURNEYS
WHERE JOURNEY_TYPE = 'Emirates Airline' AND ROUNDED_JOURNEYS_MILLIONS IS NOT NULL
ORDER BY ROUNDED_JOURNEYS_MILLIONS DESC
LIMIT 5;

-- least_popular_years_tube
-- Copy code into third SQL cell
SELECT YEAR,
	JOURNEY_TYPE,
	SUM(JOURNEYS_MILLIONS) as TOTAL_JOURNEYS_MILLIONS
FROM TFL.JOURNEYS
WHERE JOURNEY_TYPE LIKE '%Underground%'
GROUP BY YEAR, JOURNEY_TYPE
ORDER BY TOTAL_JOURNEYS_MILLIONS
LIMIT 5;

For more information about developing SQL projects please see this article.

How to author Solution Code for DataLab projects

R/Python

SQL