Project for Udacity Data Engineering Nanodegree. Designing a relational database schema and building an ETL pipeline.
This project was created as part of the Udacity Nanodegree Data Engineering program. Sparkify is a fictional music platform. The goal is to take JSON files of song data (from the Million Songs Dataset) and user-log data (generated by this event simulator) and load them into a postreSQL database whose schema has been designed to optimize for queries into user song-choice patterns.
This repository contains the files to create a relational database and export the accompanying song and user log files into it. Once the files are in this format, querying the song/log data for analysis becomes much easier. A few sample queries demonstrating this are shown at the end of this document.
The database created is named sparkifydb and is a relational database with 5 tables organized into a star schema. Table songplays is the fact table while the other 4 are dimension tables. You can see the specific contents of each table in the sql_queries.py file.
The raw data is stored within the data folder into two subfolders: log_data and song_data. Both contain JSON files which are transformed through the ETL pipeline into the database. log_data contains data from user’s actions on the Sparkify platform, while song_data contains data on specific songs.
The entire ETL process is seperated into three scripts:
sql_queries.py contains all the necessary sql queries (as strings) for creating the tables, droping the tables, and inserting data into the tables. There is also a query for matching songs from log_data and the song_data to ensure consistent song and artist ids.
create_tables.py first drops tables if they exist and then creates them accourding to the layout found in sql_queries.py.
etl.py is the ETL pipeline that loads the information from the JSON files into the relational database.
create_tables.py must be run first to create the relevent tables, then run etl.py. There will be printed messages displayed in the command line to confirm all files from log_data and song_data have been processed.
Number of songplays of each song:
SELECT song_id, COUNT(songplay_id)
FROM songplays
GROUP BY song_id
Number of songplays of each artist:
SELECT atist_id, COUNT(songplay_id)
FROM songplays
GROUP BY artist_id
Songs listened to by userA:
SELECT DISTINCT songs.song_id, songs.title
FROM songplays
JOIN songs ON songplays.song_id = songs.song_id
WHERE songplays.user_id = userA