Lucas Seidenfaden – Thoughts

Analysing Discogs Part 1: AWS

The Discogs dataset is pretty big so the first thing that we will need to do is get powerful server to download and transform the zipped up xml files we’re getting from Discogs. In order to get started you’ll need to install the AWS command line interface and create a new user with permissions to create and manage ec2 and rds instances. I’ve placed the authentication keys for my user in a new profile called discogs under ~/.aws/credentials. You’ll see this referenced in all aws commands using the --profile discogs argument.

The goal of this tutorial is to get get and prepare the Discogs data on aws completely from the command line.

The Discogs data set is hosted in the us-west-2  region so we’ll use the same region and create an EC2 instance in it.

Before getting started you will need to add a new keypair called discogs-keypair to your aws account in the us-west-2 region.

We can then run the script that tell aws to create and run a new instance.

aws ec2 run-instances \
--image-id ami-4e79ed36 \
--count 1 \
--instance-type t2.xlarge \
--key-name discogs-keypair \
--block-device-mappings '{"DeviceName": "/dev/sda1","Ebs": {"VolumeSize": 50 }}' \
--region us-west-2 \
--profile discogs \
> instance.json

We’re saving the result of this script to instance.json so we can get the instance id out later to make a followup query and get the public ip when the server is ready.

Storing the instance id to a variable so we can use it later on.

INSTANCE_ID=$(cat instance.json | grep "InstanceId" | grep -oE "\b(i-[0-9a-z]+)\b")

Instance initiation can take a while so we will wait for a bit

sleep 30

Now AWS should be ready with a freshly provisioned ec2 server so we’ll just query it and find out the details overwriting the original file we made with updated details.

Get updated data from AWS

aws ec2 describe-instances \
--instance-ids $INSTANCE_ID \
--region us-west-2 \
--profile discogs \
> instance.json

And store the IP address and VPC ID to a variable

IP_ADDRESS=$(cat instance.json | grep "PublicIpAddress" | grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b")
VPC_ID=$(cat instance.json | grep -m1 "VpcId" | grep -oE "\b(vpc-[0-9a-z]+)\b")

By default EC2 instances are not open to incoming connections so we’ll have to open up port 22 so we can ssh in. In order to do this we will create a new security group and add it to the instance.

aws ec2 create-security-group \
--group-name discogs-sg \
--description "Discogs SG" \
--vpc-id $VPC_ID \
--region us-west-2 \
--profile discogs \
> sg.json

Store the security group id

SG_ID=$(cat sg.json | grep -oE "\b(sg-[0-9a-z]+)\b")

Open up the port

aws ec2 authorize-security-group-ingress \
--group-id $SG_ID \
--protocol tcp \
--port 22 \
--cidr 0.0.0.0/0 \
--region us-west-2 \
--profile discogs

This wraps up all of the initial work we need to do to set up our EC2 instance which we need to initially download and transform the data so that we can upload it to RDS. I’ve put this whole script into one file called init.sh

The EC2 instance is now ready for us to ssh into.

ssh "ubunutu@$IP_ADDRESS"

Or if you followed the instructions below to create your key

ssh -i ~/.ssh/discogs-tut "root@$IP_ADDRESS"

Stay tuned for part 2 where we find out how to transform the Discogs dataset into something we can load into our database.

P.S. If you can’t connect to your instance after creating it here are the steps needed to create a keypair on AWS.

1. Create the key locally

ssh-keygen -t rsa -C "discogs-tut" -f ~/.ssh/discogs-tut

2. Add it to EC2

aws ec2 import-key-pair --key-name "discogs-keypair" \
--public-key-material file://~/.ssh/discogs-tut.pub \
--region us-west-2 \
--profile=discogs