Sparkify ETL Pipeline

This project was created as part of the Udacity Nanodegree Data Engineering program. Sparkify is a fictional music platform. The goal is to take JSON files of song data (from the Million Songs Dataset) and user-log data (generated by this event simulator) and load them into a postreSQL database whose schema has been designed to optimize for queries into user song-choice patterns.

This repository contains the files to create a relational database and export the accompanying song and user log files into it. Once the files are in this format, querying the song/log data for analysis becomes much easier. A few sample queries demonstrating this are shown at the end of this document.

The Database

The database created is named sparkifydb and is a relational database with 5 tables organized into a star schema. Table songplays is the fact table while the other 4 are dimension tables. You can see the specific contents of each table in the sql_queries.py file.

The Data

The raw data is stored within the data folder into two subfolders: log_data and song_data. Both contain JSON files which are transformed through the ETL pipeline into the database. log_data contains data from user’s actions on the Sparkify platform, while song_data contains data on specific songs.

The Scripts

The entire ETL process is seperated into three scripts:

sql_queries.py contains all the necessary sql queries (as strings) for creating the tables, droping the tables, and inserting data into the tables. There is also a query for matching songs from log_data and the song_data to ensure consistent song and artist ids.
create_tables.py first drops tables if they exist and then creates them accourding to the layout found in sql_queries.py.
etl.py is the ETL pipeline that loads the information from the JSON files into the relational database.

Running the files

create_tables.py must be run first to create the relevent tables, then run etl.py. There will be printed messages displayed in the command line to confirm all files from log_data and song_data have been processed.

Sample Queries

Number of songplays of each song:

SELECT song_id, COUNT(songplay_id) 
    FROM songplays 
    GROUP BY song_id

Number of songplays of each artist:

SELECT atist_id, COUNT(songplay_id) 
    FROM songplays 
    GROUP BY artist_id

Songs listened to by userA:

SELECT DISTINCT songs.song_id, songs.title 
    FROM songplays 
    JOIN songs ON songplays.song_id = songs.song_id
WHERE songplays.user_id = userA