CLIP/iCLIP_trim_and_dedup.sh · 5ff10a8620a697d3c36d17a8ff4970d59b07f6db · Blaise LI / bioinfo_utils

7 years ago

Pipeline to process iCLIP data. · 53909846

Blaise Li authored 7 years ago

Currently goes from demultiplexing to mapping, via trimming and
deduplicating. The mapping is performed on 3 read type:
- adapt_nodedup (the adaptor was found, and the reads were trimmed but
  not deduplicated)
- adapt_deduped (the adaptor was found, and the reads were trimmed and
  deduplicated)
- noadapt_deduped (the adaptor was not found, and the reads were trimmed
  and deduplicated)

The trim_and_dedup script currenly assumes that two low-diversity zones
are present, and ignores them for deduplication:

NNNNNGCACTANNNWWW[YYYY]NNNN
1---5 : 5' UMI
     6--11: barcode (lower diversity)
          12-14: UMI
            15-17: AT(or GC?)-rich (low diversity)
                [fragment]
                       -4 -> -1: 3' UMI

It may be a problem to deduplicate taking into account the end of the
reads, which tends to be of lower quality. The reads with errors will be
over-represented. That is why we decided to also look at the
non-deduplicated reads.

53909846

History

Pipeline to process iCLIP data.

Blaise Li authored 7 years ago

Currently goes from demultiplexing to mapping, via trimming and
deduplicating. The mapping is performed on 3 read type:
- adapt_nodedup (the adaptor was found, and the reads were trimmed but
  not deduplicated)
- adapt_deduped (the adaptor was found, and the reads were trimmed and
  deduplicated)
- noadapt_deduped (the adaptor was not found, and the reads were trimmed
  and deduplicated)

The trim_and_dedup script currenly assumes that two low-diversity zones
are present, and ignores them for deduplication:

NNNNNGCACTANNNWWW[YYYY]NNNN
1---5 : 5' UMI
     6--11: barcode (lower diversity)
          12-14: UMI
            15-17: AT(or GC?)-rich (low diversity)
                [fragment]
                       -4 -> -1: 3' UMI

It may be a problem to deduplicate taking into account the end of the
reads, which tends to be of lower quality. The reads with errors will be
over-represented. That is why we decided to also look at the
non-deduplicated reads.