项目作者: peterjprudhomme

项目描述 :
Project for Udacity Data Engineering Nanodegree. Designing a relational database schema and building an ETL pipeline.
高级语言: Python
项目地址: git://github.com/peterjprudhomme/Sparkify-ETL-Pipeline.git
创建时间: 2021-01-12T16:44:43Z
项目社区:https://github.com/peterjprudhomme/Sparkify-ETL-Pipeline

开源协议:

下载


Sparkify ETL Pipeline

This project was created as part of the Udacity Nanodegree Data Engineering program. Sparkify is a fictional music platform. The goal is to take JSON files of song data (from the Million Songs Dataset) and user-log data (generated by this event simulator) and load them into a postreSQL database whose schema has been designed to optimize for queries into user song-choice patterns.

This repository contains the files to create a relational database and export the accompanying song and user log files into it. Once the files are in this format, querying the song/log data for analysis becomes much easier. A few sample queries demonstrating this are shown at the end of this document.

The Database

The database created is named sparkifydb and is a relational database with 5 tables organized into a star schema. Table songplays is the fact table while the other 4 are dimension tables. You can see the specific contents of each table in the sql_queries.py file.

The Data

The raw data is stored within the data folder into two subfolders: log_data and song_data. Both contain JSON files which are transformed through the ETL pipeline into the database. log_data contains data from user’s actions on the Sparkify platform, while song_data contains data on specific songs.

The Scripts

The entire ETL process is seperated into three scripts:

  • sql_queries.py contains all the necessary sql queries (as strings) for creating the tables, droping the tables, and inserting data into the tables. There is also a query for matching songs from log_data and the song_data to ensure consistent song and artist ids.

  • create_tables.py first drops tables if they exist and then creates them accourding to the layout found in sql_queries.py.

  • etl.py is the ETL pipeline that loads the information from the JSON files into the relational database.

Running the files

create_tables.py must be run first to create the relevent tables, then run etl.py. There will be printed messages displayed in the command line to confirm all files from log_data and song_data have been processed.

Sample Queries

Number of songplays of each song:

  1. SELECT song_id, COUNT(songplay_id)
  2. FROM songplays
  3. GROUP BY song_id

Number of songplays of each artist:

  1. SELECT atist_id, COUNT(songplay_id)
  2. FROM songplays
  3. GROUP BY artist_id

Songs listened to by userA:

  1. SELECT DISTINCT songs.song_id, songs.title
  2. FROM songplays
  3. JOIN songs ON songplays.song_id = songs.song_id
  4. WHERE songplays.user_id = userA