So you’ve decided you need a digital repository, and you’ve chosen DSpace. Here’s how to host it in Amazon’s cloud.
DSpace is a reasonable choice, and it happens to be the least finicky and most modern out of its handful of competitors. But where to host it? Because it requires a Tomcat or other Java servlet environment, you can’t just toss it up in your average commodity hosting service. The next best bet is to get a cheap virtual machine, and Amazon is the place I chose. Here’s what I did.
Note: You could probably do this with equal success on Rackspace. If you do, your process will be different, of course.
The steps for getting DSpace installed and running on an Amazon EC2 instance involve:
1. Choose an EC2 instance type (and launch it, of course)
2. Configure your EBS storage
3. Configure the AWS security and Elastic IP
4. Setup an SSH connection
5. Install the DSpace prerequisites
6. Tweak some environment bits in preparation for DSpace
7. Download, configure, and build DSpace
8. Copy the desired webapps and start Tomcat
What isn’t covered here
I am not going to cover any kind of customization of DSpace. It’s complicated and well beyond the scope of a write-up on hosting. If you want your DSpace to do something that’s not completely obvious from the comments in the configuration files, you will have to do some research on your own. I am also not going to cover things like load balancing, disaster recovery, and the like, mostly because I haven’t done any of this myself yet. And finally, I am not going to do any hand-holding with regards to the Amazon AWS management console. I am assuming you know how to, for instance, launch a new instance and such. These are fairly intuitive if you’ve been around systems administration at all, I think. I will describe steps that pertain directly to the choices you have to make to get AWS configured for DSpace, though.
A note about cost
How much will it cost? That really depends on which choices you make above, but in general you only pay for what you use. Let’s just say if you’re looking to get by on a REALLY tight budget, you can (at the time of this writing; Amazon changes prices periodically) have a basic repository up and running for $16-$40 per month, depending on precisely how much storage space you’re using. But a non-enterprise class decently-performing system with, say, up to 1 TB of storage will run you on the order of $200ish a month, assuming you have used the entire terabyte of space. In my estimation, this is a pretty good price to showcase the benefits of a digital repository, after which you might want to spring for something a bit more robust. Aside from a few additional scaling considerations, though, I don’t think there is any significant difference between what I did here and what you might have to do in an enterprise setting.
Step 1: Choosing an Instance Type
If all you need to do is get something basic up and running, with no real care for performance and stability, a T1 Micro instance is probably sufficient. I have run one of these now for a number of months with few issues, but see below.
If you do choose a Micro instance, though, keep in mind you have pretty limited memory. If you don’t mind having your DSpace instance encounter occasional memory problems, terminate unexpectedly, or sometimes just perform badly, this option is for you. Do not take from this that it’s impossible or impractical to run DSpace this way. I just wouldn’t run it this way for anything you intend to use for production. As a development environment, this is perfectly acceptable.
Now, when you first choose to launch a new instance, Amazon will ask you what snapshot you want to use. Basically this is where you choose things like machine architecture (32 or 64 bit) and operating system. Unless you are experienced with a particular flavor of Linux and prefer it above all others, the Amazon AMI (64 bit) is recommended. It’s at the top of the list. It is a RedHat style Linux OS, which means you’ll be using yum for package management.
Move through the options, or accept the defaults, which appear to be sane for most purposes. During the instance setup, you have the option to configure Elastic Block Stores (EBS), which is extra disk storage for use with the instance. You may set this up as part of the instance creation, or you may do it later. If you do it later, you will just need to make sure that the EBS gets attached to the instance you created. I trust you can figure out where your volumes are listed.
The last thing you’ll do before launching the instance is to create and download your key file. This is important, because it’s the way you connect to your new VM. Save the file off somewhere where you can remember it. You will need it for Step 4.
Step 2: Configure EBS Storage
When I launched my first EC2 instance for DSpace, I didn’t configure a separate EBS for use with it, and I accepted the default 8 GB root disk size. For a number of months, this was not a problem. It became a problem when my team and I began testing out import procedures on larger volumes of content, and I ran out of disk space. Resizing the root disk for an EC2 instance is possible, but it is midly painful and requires significant downtime. Instead, I recommend, even for development machines, that you place the DSpace assetstore on its own EBS separate from the root disk. This will ensure that you can resize with less disruption, especially if you’ve also allocated an Elsatic IP (see Step 3 below).
Create your EBS with any size you like. With EBS, you are only paying for what you use, not what you have allocated. So if you choose to make the EBS 1 TB (the max), you won’t incur the maximum charge (currently $0.10 per GB per month, or $100 per month in this case) unless you actually used that much space. Allocate what makes sense to you. I have mine set at 100 GB, which gives me some room, but not too much. After all, I want a cap on the potential cost of these services, and limiting the disk space is one way to do this.
You’ll need to designate the device under which your instance can use the EBS. Typically these look like /dev/sdb or something like that. Pick something and remember what you picked.
Step 3: Security and Elastic IP
You need some security groups. These can be shared across all of your instances. I am using quick-start-1 to hold my custom security rules. This allows me to set network access policies so I can run and publicly access particular services. In particular, you will want SSH access so you can manage the instance from within the OS, and to install and configure things. I have additional ports open to support FTP, but you may not need or desire those. And finally, I have ports 80 and 8080 open for HTTP traffic. DSpace, because it’s a Java web application, will be accessible through port 8080.
Next up is Elastic IP. Each EC2 instance is supposed to include an Elastic IP allocation for free. This is useful because, if you have to take your instance down for maintenance (e.g., disk resizing), when it comes back online the underlying instance address might change. The public-facing Elastic IP will not. So you could, for instance, use a link shortener with a custom name (something TinyURL allows) to point to the public IP address without worrying about the shortened URL becoming inoperable. It’s a consideration, anyway.
Allocate and associate your Elastic IP. There’s not much else to configure.
Step 4: Setup SSH
The Linux snapshot I recommended earlier has an SSH server installed by default. You will connect to it through a single account, ec2-user, but to configure your connection, you will have to tell your SSH program where to find the authentication key. On Windows, you are probably using PuTTY for your SSH needs. I don’t know how to configure the key for use in other systems, so you’ll have to search around to figure that out if you’re using something else. Presumably, if you’re using Mac or another Linux machine, you’ll just specify the key on the command line or something.
Anyway, with PuTTY, you need the related PuTTYGen program, available at the same site you download PuTTY. Go get it, run it, and then choose Load to load an existing key file. Browse to where you saved the .pem file, show All files (*.*) and select the .pem file you had previously downloaded. Open it, then Save private key. Put it in the same location, or somewhere where you can find it again.
Now open PuTTY and in the Session pane, enter your AWS instance address, or the Elastic IP address you associated, in the Host Name field. Expand the SSH menu and choose Auth. Browse to and select the .ppk key file you saved with PuTTYGen. Go back to Session enter a name for this session (aws, perhaps), and click Save. The new session should appear in the list. Double-click your session to open the connection, and at the login prompt, enter ec2-user. If all was configured correctly, you will have successfully authenticated. Now you can finish the other preparations for the system to receive DSpace.
Step 5: Install DSpace Prerequisites
Most of this part is, or should be, straightforward and is mostly documented at the DSpace site. In general, you will need to make sure that your environment has the following components to build DSpace:
Java 6+ (Java 6 JDK is included with the Amazon AMI we are using, but you can also install Java 7 with no issues. I will go through the Java 7 install below)
Postgresql 8.4+ (If you use Oracle, you’re on your own; I will be using pg 9x)
Tomcat 5.5+ (we’ll use Tomcat 7, but Tomcat 6 seems to be better on low-end machines; use either)
Apache Maven (we’ll use 3x, no problems)
I am assuming you are running this as root or a privileged user. To become root:
sudo su -
Java 7 JDK
yum install java-1.7.0-openjdk
Once you do that, though, you need to change your default java binary. These are listed in /etc/alternatives and are symlinked. I have changed my symlink to the following.
lrwxrwxrwx 1 root root 59 Feb 25 20:21 java -> /usr/lib/jvm/java-1.7.0-openjdk-126.96.36.199.x86_64/jre/bin/java
You can do that by running the following:
ln -s /usr/lib/jvm/java-1.7.0-openjdk-188.8.131.52.x86_64/jre/bin/java
yum install postgresql-server
There are some additional tweaks to make in a bit, but we can wait a few minutes to get the rest of the stuff installed.
yum install tomcat7 tomcat7-admin-webapps
Don’t start Tomcat yet. There is still work to do.
Download it from here. Get the latest version, probably. I am using 3.0.5 with no problems. Use the binary tar.gz.
Untar and gunzip it, then copy the resulting folder to someplace like /usr/local/apache-maven/
In a moment there are some environment variables to set, but we don’t want to do this for root.
yum install ant
That’s all of the prerequisites. All of these are documented in the DSpace documentation. What’s not documented very well are items I cover in Step 6.
Step 6: Tweak the Environment
We have all of the prerequisites in place, but the environment still needs a few things for DSpace to build.
We need a DSpace Linux user to own the DSpace installation. Create one.
useradd -m dspace
Now let’s initialize the database.
service postgresql initdb
This creates the files for the database. By default, the database is created in /var/lib/pgsql9. I don’t know what the size considerations for large DSpace installations look like, but from what I have experienced to date, you will be safe if you keep the database on the root disk. If you think you want it somewhere else, the additional EBS store is not a bad place, but make sure you have a dedicated place for it.
I encountered a baffling setback when I tried installing DSpace on a Postgresql 9 database, in that by default it wants to use ident. Looking on a different system, I noticed that Postgressql was configured to trust both local and host (localhost) connections. So I set that up by editing the pg_hba.conf file generated in the previous step.
Edit it to match the following. All other lines should be commented out.
local all all trust
host all all 127.0.0.1/32 trust
I don’t know if there is anything insecure about this, but this does work, and previous installs of DSpace relied on it being this way.
Briefly, become the postgres user, so you can create our DSpace database and user. Enter each command and follow any directions. Note the password you set in the createuser step. You will need it later.
su - postgres
createuser -U postgres -d -A -P dspace
createdb -U dspace -E UNICODE dspace
Next, it’s time to mount the EBS created to hold the DSpace code and assetstore.
Add the following line to the mtab file. Note the device name you picked earlier, when you created the EBS and attached it to the EC2 instance. Here we have /dev/sdf but if this is the first EBS you’ve created for a non-root disk, it could be /dev/sda
/dev/sdf /dspace ext4 rw,noatime 0 0
Next go to the /dev directory so you can make the filesystem. Make sure the device you named and mapped in /etc/mtab is present in this directory.
Now mount the device.
Now it’s time to change some group memberships. This is not strictly necessary, but it might help mitigate potential permissions problems. I added the tomcat user to the dspace group and the dspace user to the tomcat group.
Next let’s change the /dspace directory to be owned by dspace.
chown -R dspace /dspace
Now become the dspace user so you can add the environment variables.
su - dspace
Add or change the following lines:
Here we’ve pointed M2_HOME to the location where we installed the Maven binary, set the M2 variable to its bin directory, and added the bin location to our path. With the environment variables in this file, these get set every time the dspace user logs in. So we need to login again to make sure they are correctly set.
su - dspace
Step 7: Download, configure, and build DSpace
This entire step assumes you are the dspace user, which you will be after following the previous steps.
Download the DSpace code from dspace.org. Since it comes from SourceForge, you’ll have to click to download, then copy the direct link, pasting it in your SSH window.
Unzip the file and perform the following actions.
Follow the directions in build.properties to get things set up. The comments are pretty helpful, but you can also refer to the DSpace documentation if you need assistance. What these get set to may depend on your local needs.
Pay particular attention to the db.password, which should match what you set for the dspace user above.
Many of the defaults are fine to use.
After saving and exiting the build.properties file, you can finally begin the compile process.
Wait for the mvn package command to finish. It can take a while. Once it’s completed successfully, you will see a big SUCCESS message. Now move into the target directory and do the actual install.
This creates the database schema, adds default data, copies the DSpace code, and such. Watch to make sure this is successful as well. It takes some time.
Step 8: Copy the desired webapps and start Tomcat
Finally you can copy your webapps to where Tomcat can use them. If you know which particular webapps you want to copy, you can copy them one by one, or you can just copy all of them.
One by one, for example:
cp -R xmlui/ /usr/share/tomcat7/webapps/
cp -R /dspace/webapps/* /usr/share/tomcat7/webapps/
It takes Tomcat a while to start, sometimes as long as 20 minutes if you aren’t very lucky. Mostly it takes around 2-5 minutes.
Wrapping it up
That’s it! You should now have a functioning DSpace. Other things you’ll want to do include creating an administrator account, testing the web application, and such. Try following along in the DSpace manual or reading up on the various command line tools available to find out what else you can do, including advanced configurations and such.
Discuss this with me on Google+: https://plus.google.com/u/0/102013352492248128705/posts/Ly1LsBhksSv