Commit 651236ff authored by Rayan  CHIKHI's avatar Rayan CHIKHI
Browse files

initial commit

parents
## Code of Conduct
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
opensource-codeofconduct@amazon.com with any additional questions or comments.
# Contributing Guidelines
Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
documentation, we greatly value feedback and contributions from our community.
Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
information to effectively respond to your bug report or contribution.
## Reporting Bugs/Feature Requests
We welcome you to use the GitHub issue tracker to report bugs or suggest features.
When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
* A reproducible test case or series of steps
* The version of our code being used
* Any modifications you've made relevant to the bug
* Anything unusual about your environment or deployment
## Contributing via Pull Requests
Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
1. You are working against the latest source on the *master* branch.
2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
To send us a pull request, please:
1. Fork the repository.
2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
3. Ensure local tests pass.
4. Commit to your fork using clear commit messages.
5. Send us a pull request, answering any default questions in the pull request interface.
6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
## Finding contributions to work on
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
## Code of Conduct
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
opensource-codeofconduct@amazon.com with any additional questions or comments.
## Security issue notifications
If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
## Licensing
See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
# Download SRA reads and putting them on another S3 bucket, converted to fastq, on AWS Batch
### Source
Similar to https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly and references therein.
Aims at similar functionality as https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly
### Installation
Execute the below commands to spin up the infrastructure cloudformation stack.
```
./spinup.sh
./deploy-docker.sh
```
If you ever recreate the stack (e.g. after `cleanup.sh`), you don't need to run `deploy-docker.sh` unless the Dockerfile or scripts in `src/` were modified.
### Running a download job
1. `./submit_job.py SRRxxxxxx`
2. In AWS Console > Batch, monitor how the job runs.
### Code Cleanup
In short:
```
./cleanup.sh
```
Which deletes the CloudFormation stack.
What it doesn't do (need sto be done manually):
AWS Console > ECR - serratus-dl-batch-job - delete the image(s) that are pushed to the repository
## License
This library is licensed under the MIT-0 License. See the LICENSE file.
aws cloudformation delete-stack --stack-name serratus-batch-dl
#!/bin/bash
cd src
ACCOUNT=$(aws sts get-caller-identity --query Account --output text) # AWS ACCOUNT ID
DOCKER_CONTAINER=serratus-dl-batch-job
REPO=${ACCOUNT}.dkr.ecr.us-east-1.amazonaws.com/${DOCKER_CONTAINER}
TAG=build-$(date -u "+%Y-%m-%d")
echo "Building Docker Image..."
docker build -t $DOCKER_CONTAINER \
--build-arg AWS_ACCESS_KEY_ID=$(./get-aws-profile.sh --key) \
--build-arg AWS_SECRET_ACCESS_KEY=$(./get-aws-profile.sh --secret) \
--build-arg AWS_DEFAULT_REGION=us-east-1 \
.
echo "Authenticating against AWS ECR..."
eval $(aws ecr get-login --no-include-email --region us-east-1)
# create repository (only needed the first time)
aws ecr create-repository --repository-name $DOCKER_CONTAINER
echo "Tagging ${REPO}..."
docker tag $DOCKER_CONTAINER:latest $REPO:$TAG
docker tag $DOCKER_CONTAINER:latest $REPO:latest
echo "Deploying to AWS ECR"
docker push $REPO
aws cloudformation create-stack --stack-name serratus-batch-dl --template-body file://template/template.yaml --capabilities CAPABILITY_NAMED_IAM
FROM python
# https://pythonspeed.com/articles/alpine-docker-python/
WORKDIR /
COPY batch_processor.py .
RUN pip install --upgrade pip && \
pip install boto3 awscli
# local AWS credentials
ARG AWS_DEFAULT_REGION
#ENV AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION
ARG AWS_ACCESS_KEY_ID
#ENV AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
#ENV AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
#SRA toolkit (from serratus-dl)
ENV SRATOOLKITVERSION='2.10.4'
RUN wget --quiet https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/${SRATOOLKITVERSION}/sratoolkit.${SRATOOLKITVERSION}-centos_linux64.tar.gz &&\
tar xzf sratoolkit.${SRATOOLKITVERSION}-centos_linux64.tar.gz &&\
rm -f sratoolkit.${SRATOOLKITVERSION}-centos_linux64.tar.gz &&\
mkdir -p /opt/sratools &&\
# Keep sratools grouped together, so its easy to copy them all out into the runtime
bash -c "mv sratoolkit.${SRATOOLKITVERSION}-centos_linux64/bin/{vdb-config*,prefetch*,fastq-dump*,fasterq-dump*,sra*} /opt/sratools" &&\
# Install into /usr/local/bin for the rest of the build
cp -r /opt/sratools/* /usr/local/bin &&\
mkdir /etc/ncbi
# https://github.com/ababaian/serratus/blob/5d288765b6e22bf7ba1b69148e0013d65560b968/containers/serratus-dl/Dockerfile#L51
RUN mkdir -p /root/.ncbi
RUN wget -O /root/.ncbi/user-settings.mkfg https://raw.githubusercontent.com/ababaian/serratus/master/containers/serratus-dl/VDB_user-settings.mkfg
RUN vdb-config --report-cloud-identity yes
# https://github.com/ababaian/serratus/blob/5d288765b6e22bf7ba1b69148e0013d65560b968/containers/serratus-dl/serratus-dl.sh#L167
RUN DLID="$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 8 | head -n 1 )-$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 4 | head -n 1 )-$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 4 | head -n 1 )-$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 4 | head -n 1 )-$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 12 | head -n 1 )" && sed -i "s/52e8a8fe-0cac-4bf2-983a-3617cdba7df5/$DLID/g" /root/.ncbi/user-settings.mkfg
# parallel-fastq-dump install
RUN wget --quiet https://raw.githubusercontent.com/rvalieris/parallel-fastq-dump/master/parallel-fastq-dump
RUN chmod +x parallel-fastq-dump
# fastp install
RUN wget --quiet http://opengene.org/fastp/fastp
RUN chmod +x fastp
RUN pwd
RUN df -h .
RUN ls
import boto3
from boto3.dynamodb.conditions import Key, Attr
import csv, sys, time, argparse
from datetime import datetime
import json
import os
import sys
from operator import itemgetter, attrgetter
from time import sleep
import urllib3
import json
LOGTYPE_ERROR = 'ERROR'
LOGTYPE_INFO = 'INFO'
LOGTYPE_DEBUG = 'DEBUG'
def process_file(accession, region):
#try:
if True:
urllib3.disable_warnings()
s3 = boto3.client('s3')
print("region - " + region)
startTime = datetime.now()
# go to /tmp (important, that's where local storage / nvme is)
os.chdir("/tmp")
os.system(' '.join(["pwd"]))
# check free space
os.system(' '.join(["df", "-h", "."]))
# download reads from accession
os.system('mkdir -p out/')
os.system('prefetch '+accession)
os.system('../parallel-fastq-dump --split-files --outdir out/ --threads 4 --sra-id '+accession)
files = os.listdir(os.getcwd() + "/out/")
print("after fastq-dump, dir listing", files)
inputDataFn = accession+".inputdata.txt"
g = open(inputDataFn,"w")
for f in files:
g.write(f + " " + str(os.stat("out/"+f).st_size)+"\n")
g.close()
# potential todo: there is opportunity to use mkfifo and speed-up parallel-fastq-dump -> bbduk step
# as per https://github.com/ababaian/serratus/blob/master/containers/serratus-dl/run_dl-sra.sh#L26
# run fastp
os.system(' '.join(["cat","out/*.fastq","|","../fastp", "--trim_poly_x", "--stdin", "-o", accession + ".fastq"]))
# upload filtered reads to s3
outputBucket = "serratus-rayan"
s3.upload_file(accession+".fastq", outputBucket, "reads/"+accession+".fastq")
endTime = datetime.now()
diffTime = endTime - startTime
logMessage(accession, "Serratus-batch-dl processing time - " + str(diffTime.seconds), LOGTYPE_INFO)
def main():
accession = ""
region = "us-east-1"
if "Accession" in os.environ:
accession = os.environ.get("Accession")
if "Region" in os.environ:
region = os.environ.get("Region")
if len(accession) == 0:
exit("This script needs an environment variable Accession set to something")
logMessage(accession, 'parameters: ' + accession+ " " + region, LOGTYPE_INFO)
process_file(accession,region)
def logMessage(fileName, message, logType):
try:
logMessageDetails = constructMessageFormat(fileName, message, "", logType)
if logType == "INFO" or logType == "ERROR":
print(logMessageDetails)
elif logType == "DEBUG":
try:
if os.environ.get('DEBUG') == "LOGTYPE":
print(logMessageDetails)
except KeyError:
pass
except Exception as ex:
logMessageDetails = constructMessageFormat(fileName, message, "Error occurred at Batch_processor.logMessage" + str(ex), logType)
print(logMessageDetails)
def constructMessageFormat(fileName, message, additionalErrorDetails, logType):
if additionalErrorDetails != "":
return "fileName: " + fileName + " " + logType + ": " + message + " Additional Details - " + additionalErrorDetails
else:
return "fileName: " + fileName + " " + logType + ": " + message
if __name__ == '__main__':
main()
#!/bin/bash -f
#
# Fetch the AWS access key and/or secret for an AWS profile
# stored in the ~/.aws/credentials file ini format
#
# Aaron Roydhouse <aaron@roydhouse.com>, 2017
# https://github.com/whereisaaron/get-aws-profile-bash/
#
#
# cfg_parser - Parse and ini files into variables
# By Andres J. Diaz
# http://theoldschooldevops.com/2008/02/09/bash-ini-parser/
# Use pastebin link only and WordPress corrupts it
# http://pastebin.com/f61ef4979 (original)
# http://pastebin.com/m4fe6bdaf (supports spaces in values)
#
cfg_parser ()
{
IFS=$'\n' && ini=( $(<$1) ) # convert to line-array
ini=( ${ini[*]//;*/} ) # remove comments ;
ini=( ${ini[*]//\#*/} ) # remove comments #
ini=( ${ini[*]/\ =/=} ) # remove tabs before =
ini=( ${ini[*]/=\ /=} ) # remove tabs be =
ini=( ${ini[*]/\ *=\ /=} ) # remove anything with a space around =
ini=( ${ini[*]/#[/\}$'\n'cfg.section.} ) # set section prefix
ini=( ${ini[*]/%]/ \(} ) # convert text2function (1)
ini=( ${ini[*]/=/=\( } ) # convert item to array
ini=( ${ini[*]/%/ \)} ) # close array parenthesis
ini=( ${ini[*]/%\\ \)/ \\} ) # the multiline trick
ini=( ${ini[*]/%\( \)/\(\) \{} ) # convert text2function (2)
ini=( ${ini[*]/%\} \)/\}} ) # remove extra parenthesis
ini[0]="" # remove first element
ini[${#ini[*]} + 1]='}' # add the last brace
eval "$(echo "${ini[*]}")" # eval the result
}
# echo a message to standard error (used for messages not intended
# to be parsed by scripts, such as usage messages, warnings or errors)
echo_stderr() {
echo "$@" >&2
}
#
# Parse options
#
display_usage ()
{
echo_stderr "Usage: $0 [--credentials=<path>] [--profile=<name>] [--key|--secret|--session-token]"
echo_stderr " Default --credentials is '~/.aws/credentials'"
echo_stderr " Default --profile is 'default'"
echo_stderr " By default environment variables are generate, e.g."
echo_stderr " source \$($0 --profile=myprofile)"
echo_stderr " You can specify one of --key, --secret, -or --session-token to get just that value, with no line break,"
echo_stderr " FOO_KEY=\$($0 --profile=myprofile --key)"
echo_stderr " FOO_SECRET=\$($0 --profile=myprofile --secret)"
echo_stderr " FOO_SESSION_TOKEN=\$($0 --profile=myprofile --session-token)"
}
for i in "$@"
do
case $i in
--credentials=*)
CREDENTIALS="${i#*=}"
shift # past argument=value
;;
--profile=*)
PROFILE="${i#*=}"
shift # past argument=value
;;
--key)
SHOW_KEY=true
shift # past argument with no value
;;
--secret)
SHOW_SECRET=true
shift # past argument with no value
;;
--session-token)
SHOW_SESSION_TOKEN=true
shift # past argument with no value
;;
--help)
display_usage
exit 0
;;
*)
# unknown option
echo "Unknown option $1"
display_usage
exit 1
;;
esac
done
#
# Check options
#
CREDENTIALS=${CREDENTIALS:-~/.aws/credentials}
PROFILE=${PROFILE:-default}
SHOW_KEY=${SHOW_KEY:-false}
SHOW_SECRET=${SHOW_SECRET:-false}
SHOW_SESSION_TOKEN=${SHOW_SESSION_TOKEN:-false}
if [[ "${SHOW_KEY}" = true && "${SHOW_SECRET}" = true ]]; then
echo_stderr "Can only specify one of --key or --secret"
display_usage
exit 2
fi
#
# Parse and display
#
if [[ ! -r "${CREDENTIALS}" ]]; then
echo_stderr "File not found: '${CREDENTIALS}'"
exit 3
fi
cfg_parser "${CREDENTIALS}"
if [[ $? -ne 0 ]]; then
echo_stderr "Parsing credentials file '${CREDENTIALS}' failed"
exit 4
fi
cfg.section.${PROFILE}
if [[ $? -ne 0 ]]; then
echo_stderr "Profile '${PROFILE}' not found"
exit 5
fi
if [[ "${SHOW_KEY}" = false && "${SHOW_SECRET}" = false && "${SHOW_SESSION_TOKEN}" = false ]]; then
echo "export AWS_ACCESS_KEY_ID=${aws_access_key_id}"
echo "export AWS_SECRET_ACCESS_KEY=${aws_secret_access_key}"
echo "export AWS_SESSION_TOKEN=${aws_session_token}"
elif [[ "${SHOW_KEY}" = true ]]; then
echo -n "${aws_access_key_id}"
elif [[ "${SHOW_SECRET}" = true ]]; then
echo -n "${aws_secret_access_key}"
elif [[ "${SHOW_SESSION_TOKEN}" = true ]]; then
echo -n "${aws_session_token}"
else
echo_stderr "Unknown error"
exit 9
fi
exit 0
NAME=serratus-dl-batch-job
docker build -t $NAME \
--build-arg AWS_ACCESS_KEY_ID=$(./get-aws-profile.sh --key) \
--build-arg AWS_SECRET_ACCESS_KEY=$(./get-aws-profile.sh --secret) \
--build-arg AWS_DEFAULT_REGION=us-east-1 \
.
docker run \
-e AWS_ACCESS_KEY_ID=$(./get-aws-profile.sh --key)\
-e AWS_SECRET_ACCESS_KEY=$(./get-aws-profile.sh --secret)\
-e AWS_DEFAULT_REGION=us-east-1\
-e Accession=SRR10975663 \
$NAME \
python batch_processor.py
export Accession=SRR10975663
# bit bigger
#export Accession=SRR10041282
export Region=us-east-1
python batch_processor.py
import json
import boto3
import sys
if len(sys.argv) < 2:
exit("argument: [accession]")
accession = sys.argv[1]
if "RR" not in accession:
exit("accession should be of the form: [E/S]RR[0-9]+")
batch = boto3.client('batch')
region = batch.meta.region_name
response = batch.submit_job(jobName='RayanSerratusDlBatchProcessingJobQueue',
jobQueue='RayanSerratusDlBatchProcessingJobQueue',
jobDefinition='RayanSerratusDlBatchJobDefinition',
containerOverrides={
"command": [ "python", "batch_processor.py"],
"environment": [
{"name": "Accession", "value": accession},
{"name": "Region", "value": region},
]
})
print("Job ID is {}.".format(response['jobId']))
---
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Orchestrating an Application Process with AWS Batch using CloudFormation'
Resources:
VPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
InternetGateway:
Type: AWS::EC2::InternetGateway
RouteTable:
Type: AWS::EC2::RouteTable
Properties:
VpcId:
Ref: VPC
VPCGatewayAttachment:
Type: AWS::EC2::VPCGatewayAttachment
Properties:
VpcId:
Ref: VPC
InternetGatewayId:
Ref: InternetGateway
SecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: EC2 Security Group for instances launched in the VPC by Batch
VpcId:
Ref: VPC
Subnet:
Type: AWS::EC2::Subnet
Properties:
CidrBlock: 10.0.0.0/24
VpcId:
Ref: VPC
MapPublicIpOnLaunch: 'True'
Route:
Type: AWS::EC2::Route
Properties:
RouteTableId:
Ref: RouteTable
DestinationCidrBlock: 0.0.0.0/0
GatewayId:
Ref: InternetGateway
SubnetRouteTableAssociation:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
RouteTableId:
Ref: RouteTable
SubnetId:
Ref: Subnet
BatchServiceRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: batch.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole
IamInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
Roles:
- Ref: EcsInstanceRole
EcsInstanceRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2008-10-17'
Statement:
- Sid: ''
Effect: Allow
Principal:
Service: ec2.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
- arn:aws:iam::aws:policy/AmazonS3FullAccess
SpotIamFleetRole: # taken from https://github.com/aodn/aws-wps/blob/master/wps-cloudformation-template.yaml
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service: spot.amazonaws.com
Action: sts:AssumeRole
- Effect: Allow
Principal:
Service: spotfleet.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole
RayanSerratusDlBatchProcessingJobDefinition:
Type: AWS::Batch::JobDefinition
Properties:
Type: container
JobDefinitionName: RayanSerratusDlBatchJobDefinition
ContainerProperties:
Image:
Fn::Join:
- ''
- - Ref: AWS::AccountId
- .dkr.ecr.
- Ref: AWS::Region
- ".amazonaws.com/aws-batch-s3-contigs-minia-job:latest"
Vcpus: 4
Memory: 7000
MountPoints:
- ContainerPath: /tmp
SourceVolume: temp_dir
Volumes:
- Host:
SourcePath: /tmp
Name: temp_dir
RetryStrategy:
Attempts: 1
RayanSerratusDlBatchProcessingJobQueue:
Type: AWS::Batch::JobQueue
Properties:
JobQueueName: RayanSerratusDlBatchProcessingJobQueue
Priority: 1
ComputeEnvironmentOrder:
- Order: 1
ComputeEnvironment: