Skip to main content

Table 2 Timings in seconds for the different pipeline stages when running Crossbow on HPC node (16 CPU cores) and Hadoop I cluster (eight nodes, 56 CPU cores) and Hadoop II cluster (eight nodes, 112 CPU cores) for Datasets S1-S9

From: A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data

Stages

Platform

Datasets

  

S1

S2

S3

S4

S5

S6

S7

S8

S9

ingest to HDFS

Hadoop I,II

106

236

472

606

862

974

1018

1244

1384

conversion

gz split (HPC)

466

782

1094

1406

1728

2052

2390

2774

3090

 

gz to bz2 conversion (Hadoop I,II)

211

423

633

842

1056

1264

1473

1685

1911

preprocess

HPC

406

630

1002

1235

1469

1810

2043

2283

2660

 

Hadoop I

560

891

1172

1672

1937

2271

2665

3011

3396

 

Hadoop II

537

685

892

1179

1414

1641

2091

2334

2613

map

HPC

1434

2857

4281

5775

7216

8627

10088

11432

13028

 

Hadoop I

707

1385

2060

3331

3398

4163

4761

5630

6276

 

Hadoop II

511

981

1459

2636

3023

3194

3361

4553

4766

 

Hadoop II*

486

955

1422

1882

2336

2812

3310

3771

4305

SNP call

HPC

1045

1698

2621

3553

10989

18993

16890

20785

21948

 

Hadoop I

666

994

1127

1423

1906

2287

2765

2982

3444

 

Hadoop II

661

965

1344

1364

1830

2450

2765

3029

3471

total time

HPC

3351

5967

8998

11968

21402

31554

31411

37274

40726

 

Hadoop I

2250

3929

5464

7848

9159

10959

12682

14558

16436

 

Hadoop II

2026

3289

4601

6607

7719

8903

10393

12845

14145

  1. The ‘Hadoop II*’ data were obtained as follows: the average time for each mapping job was multiplied by the number of successful Hadoop mapping jobs, omitting the failed jobs. The errors are not shown.