Sunday, August 25, 2013

Tratando archivos de texto gigantes en linux.

INTRODUCCIÓN

Al tratar con archivos de texto gigantes (recientemente en mi experiencia un dump SQL de ~37GB) surgen complicaciones: espacio en disco, lentitud, memoria, etc.
A continuación se presentan 3 técnicas ensayadas para tratar de solventar el problema.




INICIO

Invocación de script para generar "huge file":
$ ./build_huge_02.sh 1000000 > myhugefile.txt


Estado inicial del directorio y archivos:

$ pwd
/home/fulano/Desktop/HugeFiles/poc
$ ls -l
total 3661612
-rwxrwxr-x 1 fulano fulano       216 Aug 25 12:22 build_huge_02.sh
-rw-rw-r-- 1 fulano fulano      1034 Aug 25 12:22 huge_parser_00.rb
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:25 myhugefile_00.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:28 myhugefile_grep.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:27 myhugefile_ruby.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:26 myhugefile_split.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:23 myhugefile.txt
-rwxrwxr-x 1 fulano fulano       314 Aug 25 12:23 split_huge_01.sh
$ wc -l myhugefile*
  15000000 myhugefile_00.txt
  15000000 myhugefile_grep.txt
  15000000 myhugefile_ruby.txt
  15000000 myhugefile_split.txt
  15000000 myhugefile.txt
  75000000 total
$


El objetivo es cambiar las siguientes 4 cadenas:
= ITERATION 250000 =
= ITERATION 500000 =
= ITERATION 750000 =
= ITERATION 1000000 =
respectivamente por:
= ITERACION #250000 =
= ITERACION #500000 =
= ITERACION #750000 =
= ITERACION #1000000 =


Se verifica el estado *inicial* de las cadenas (regexes) en los archivos:

$ grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' myhugefile*
myhugefile_00.txt:3749986:========== ITERATION 250000 ==========
myhugefile_00.txt:7499986:========== ITERATION 500000 ==========
myhugefile_00.txt:11249986:========== ITERATION 750000 ==========
myhugefile_00.txt:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.txt:3749986:========== ITERATION 250000 ==========
myhugefile_grep.txt:7499986:========== ITERATION 500000 ==========
myhugefile_grep.txt:11249986:========== ITERATION 750000 ==========
myhugefile_grep.txt:14999986:========== ITERATION 1000000 ==========
myhugefile_ruby.txt:3749986:========== ITERATION 250000 ==========
myhugefile_ruby.txt:7499986:========== ITERATION 500000 ==========
myhugefile_ruby.txt:11249986:========== ITERATION 750000 ==========
myhugefile_ruby.txt:14999986:========== ITERATION 1000000 ==========
myhugefile_split.txt:3749986:========== ITERATION 250000 ==========
myhugefile_split.txt:7499986:========== ITERATION 500000 ==========
myhugefile_split.txt:11249986:========== ITERATION 750000 ==========
myhugefile_split.txt:14999986:========== ITERATION 1000000 ==========
myhugefile.txt:3749986:========== ITERATION 250000 ==========
myhugefile.txt:7499986:========== ITERATION 500000 ==========
myhugefile.txt:11249986:========== ITERATION 750000 ==========
myhugefile.txt:14999986:========== ITERATION 1000000 ==========
$




TÉCNICA 1: ruby script.


$ pwd
/home/fulano/Desktop/HugeFiles/poc
$ ls -l
total 3661616
-rwxrwxr-x 1 fulano fulano       216 Aug 25 12:22 build_huge_02.sh
-rw-rw-r-- 1 fulano fulano       848 Aug 25 13:29 huge_parser_00.rb
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:25 myhugefile_00.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:28 myhugefile_grep.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:27 myhugefile_ruby.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:26 myhugefile_split.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:23 myhugefile.txt
-rwxrwxr-x 1 fulano fulano       314 Aug 25 12:23 split_huge_01.sh
-rw-rw-r-- 1 fulano fulano      3077 Aug 25 13:28 STEPS.txt
$ ruby --version
ruby 2.0.0p0 (2013-02-24 revision 39474) [i686-linux]
$ time ruby ./huge_parser_00.rb < myhugefile_ruby.txt > myhugefile_ruby.new00
= ITERATION 250000 =: = ITERACION #250000 =
= ITERATION 500000 =: = ITERACION #500000 =
= ITERATION 750000 =: = ITERACION #750000 =
= ITERATION 1000000 =: = ITERACION #1000000 =
Matched: = ITERATION 250000 = -> = ITERACION #250000 =
Matched: = ITERATION 500000 = -> = ITERACION #500000 =
Matched: = ITERATION 750000 = -> = ITERACION #750000 =
Matched: = ITERATION 1000000 = -> = ITERACION #1000000 =
real 18m33.633s
user 17m41.852s
sys 0m9.196s
$ ls -l
total 4393936
-rwxrwxr-x 1 fulano fulano       216 Aug 25 12:22 build_huge_02.sh
-rw-rw-r-- 1 fulano fulano       848 Aug 25 13:29 huge_parser_00.rb
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:25 myhugefile_00.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:28 myhugefile_grep.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 13:49 myhugefile_ruby.new00
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:27 myhugefile_ruby.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:26 myhugefile_split.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:23 myhugefile.txt
-rwxrwxr-x 1 fulano fulano       314 Aug 25 12:23 split_huge_01.sh
-rw-rw-r-- 1 fulano fulano      3077 Aug 25 13:28 STEPS.txt
$ grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' myhugefile_ruby.*
myhugefile_ruby.new00:3749986:========== ITERACION #250000 ==========
myhugefile_ruby.new00:7499986:========== ITERACION #500000 ==========
myhugefile_ruby.new00:11249986:========== ITERACION #750000 ==========
myhugefile_ruby.new00:14999986:========== ITERACION #1000000 ==========
myhugefile_ruby.txt:3749986:========== ITERATION 250000 ==========
myhugefile_ruby.txt:7499986:========== ITERATION 500000 ==========
myhugefile_ruby.txt:11249986:========== ITERATION 750000 ==========
myhugefile_ruby.txt:14999986:========== ITERATION 1000000 ==========
$




TÉCNICA 2: split en varios archivos.


$ pwd
/home/fulano/Desktop/HugeFiles/poc
$ ls -l
total 4393952
-rwxrwxr-x 1 fulano fulano       216 Aug 25 12:22 build_huge_02.sh
-rw-rw-r-- 1 fulano fulano      6470 Aug 25 14:25 errors.txt
-rw-rw-r-- 1 fulano fulano       848 Aug 25 13:29 huge_parser_00.rb
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:25 myhugefile_00.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:28 myhugefile_grep.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 13:49 myhugefile_ruby.new00
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:27 myhugefile_ruby.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:26 myhugefile_split.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:23 myhugefile.txt
-rwxrwxr-x 1 fulano fulano       537 Aug 25 14:26 split_huge_01.sh
-rw-rw-r-- 1 fulano fulano      8970 Aug 25 14:05 STEPS.txt
$ wc -l myhugefile_split.txt
15000000 myhugefile_split.txt
$ time split -l 1000000 myhugefile_split.txt myhugefile_split.part.
real 0m20.374s
user 0m0.416s
sys 0m1.636s
$ ls -l
total 5126320
-rwxrwxr-x 1 fulano fulano       216 Aug 25 12:22 build_huge_02.sh
-rw-rw-r-- 1 fulano fulano      6470 Aug 25 14:25 errors.txt
-rw-rw-r-- 1 fulano fulano       848 Aug 25 13:29 huge_parser_00.rb
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:25 myhugefile_00.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:28 myhugefile_grep.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 13:49 myhugefile_ruby.new00
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:27 myhugefile_ruby.txt
-rw-rw-r-- 1 fulano fulano  49922254 Aug 25 14:27 myhugefile_split.part.aa
-rw-rw-r-- 1 fulano fulano  49966560 Aug 25 14:27 myhugefile_split.part.ab
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ac
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.ad
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ae
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.af
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.ag
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ah
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ai
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.aj
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ak
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.al
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.am
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.an
-rw-rw-r-- 1 fulano fulano  50000082 Aug 25 14:27 myhugefile_split.part.ao
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:26 myhugefile_split.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:23 myhugefile.txt
-rwxrwxr-x 1 fulano fulano       537 Aug 25 14:28 split_huge_01.sh
-rw-rw-r-- 1 fulano fulano      8970 Aug 25 14:05 STEPS.txt
$ time ./split_huge_01.sh
myhugefile_split.part.ad:749986:========== ITERATION 250000 ==========
myhugefile_split.part.ah:499986:========== ITERATION 500000 ==========
myhugefile_split.part.al:249986:========== ITERATION 750000 ==========
myhugefile_split.part.ao:999986:========== ITERATION 1000000 ==========
real 0m1.487s
user 0m1.052s
sys 0m0.340s
$ nano myhugefile_split.part.ad
$ nano myhugefile_split.part.ah
$ nano myhugefile_split.part.al
$ nano myhugefile_split.part.ao
$ ls -l
total 5126320
-rwxrwxr-x 1 fulano fulano       216 Aug 25 12:22 build_huge_02.sh
-rw-rw-r-- 1 fulano fulano      6470 Aug 25 14:25 errors.txt
-rw-rw-r-- 1 fulano fulano       848 Aug 25 13:29 huge_parser_00.rb
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:25 myhugefile_00.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:28 myhugefile_grep.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 13:49 myhugefile_ruby.new00
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:27 myhugefile_ruby.txt
-rw-rw-r-- 1 fulano fulano  49922254 Aug 25 14:27 myhugefile_split.part.aa
-rw-rw-r-- 1 fulano fulano  49966560 Aug 25 14:27 myhugefile_split.part.ab
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ac
-rw-rw-r-- 1 fulano fulano  50000028 Aug 25 14:30 myhugefile_split.part.ad
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ae
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.af
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.ag
-rw-rw-r-- 1 fulano fulano  49999893 Aug 25 14:30 myhugefile_split.part.ah
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ai
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.aj
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ak
-rw-rw-r-- 1 fulano fulano  50000082 Aug 25 14:31 myhugefile_split.part.al
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.am
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.an
-rw-rw-r-- 1 fulano fulano  50000083 Aug 25 14:32 myhugefile_split.part.ao
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:26 myhugefile_split.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:23 myhugefile.txt
-rwxrwxr-x 1 fulano fulano       537 Aug 25 14:28 split_huge_01.sh
-rw-rw-r-- 1 fulano fulano      8970 Aug 25 14:05 STEPS.txt
$ time ./split_huge_01.sh
myhugefile_split.part.ad:749986:========== ITERACION #250000 ==========
myhugefile_split.part.ah:499986:========== ITERACION #500000 ==========
myhugefile_split.part.al:249986:========== ITERACION #750000 ==========
myhugefile_split.part.ao:999986:========== ITERACION #1000000 ==========
real 0m1.406s
user 0m1.056s
sys 0m0.316s
$ time cat myhugefile_split.part.* > myhugefile_split.new00
real 0m22.121s
user 0m0.000s
sys 0m1.572s
$ ls -l
total 5858640
-rwxrwxr-x 1 fulano fulano       216 Aug 25 12:22 build_huge_02.sh
-rw-rw-r-- 1 fulano fulano      6470 Aug 25 14:25 errors.txt
-rw-rw-r-- 1 fulano fulano       848 Aug 25 13:29 huge_parser_00.rb
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:25 myhugefile_00.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:28 myhugefile_grep.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 13:49 myhugefile_ruby.new00
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:27 myhugefile_ruby.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 14:33 myhugefile_split.new00
-rw-rw-r-- 1 fulano fulano  49922254 Aug 25 14:27 myhugefile_split.part.aa
-rw-rw-r-- 1 fulano fulano  49966560 Aug 25 14:27 myhugefile_split.part.ab
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ac
-rw-rw-r-- 1 fulano fulano  50000028 Aug 25 14:30 myhugefile_split.part.ad
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ae
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.af
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.ag
-rw-rw-r-- 1 fulano fulano  49999893 Aug 25 14:30 myhugefile_split.part.ah
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ai
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.aj
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ak
-rw-rw-r-- 1 fulano fulano  50000082 Aug 25 14:31 myhugefile_split.part.al
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.am
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.an
-rw-rw-r-- 1 fulano fulano  50000083 Aug 25 14:32 myhugefile_split.part.ao
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:26 myhugefile_split.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:23 myhugefile.txt
-rwxrwxr-x 1 fulano fulano       537 Aug 25 14:28 split_huge_01.sh
-rw-rw-r-- 1 fulano fulano      8970 Aug 25 14:05 STEPS.txt
$ grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' myhugefile_split.txt
myhugefile_split.txt:3749986:========== ITERATION 250000 ==========
myhugefile_split.txt:7499986:========== ITERATION 500000 ==========
myhugefile_split.txt:11249986:========== ITERATION 750000 ==========
myhugefile_split.txt:14999986:========== ITERATION 1000000 ==========
$ grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' myhugefile_split.new00
myhugefile_split.new00:3749986:========== ITERACION #250000 ==========
myhugefile_split.new00:7499986:========== ITERACION #500000 ==========
myhugefile_split.new00:11249986:========== ITERACION #750000 ==========
myhugefile_split.new00:14999986:========== ITERACION #1000000 ==========
$ wc -l myhugefile_split.txt
15000000 myhugefile_split.txt
$ wc -l myhugefile_split.new00
15000000 myhugefile_split.new00
$




TÉCNICA 3: grep, head, sed, cat.


$ pwd
/home/fulano/Desktop/HugeFiles/poc
$ ls -l
total 5858644
-rwxrwxr-x 1 fulano fulano       216 Aug 25 12:22 build_huge_02.sh
-rw-rw-r-- 1 fulano fulano      6470 Aug 25 14:25 errors.txt
-rw-rw-r-- 1 fulano fulano       848 Aug 25 13:29 huge_parser_00.rb
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:25 myhugefile_00.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:28 myhugefile_grep.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 13:49 myhugefile_ruby.new00
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:27 myhugefile_ruby.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 14:33 myhugefile_split.new00
-rw-rw-r-- 1 fulano fulano  49922254 Aug 25 14:27 myhugefile_split.part.aa
-rw-rw-r-- 1 fulano fulano  49966560 Aug 25 14:27 myhugefile_split.part.ab
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ac
-rw-rw-r-- 1 fulano fulano  50000028 Aug 25 14:30 myhugefile_split.part.ad
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ae
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.af
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.ag
-rw-rw-r-- 1 fulano fulano  49999893 Aug 25 14:30 myhugefile_split.part.ah
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ai
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.aj
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ak
-rw-rw-r-- 1 fulano fulano  50000082 Aug 25 14:31 myhugefile_split.part.al
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.am
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.an
-rw-rw-r-- 1 fulano fulano  50000083 Aug 25 14:32 myhugefile_split.part.ao
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:26 myhugefile_split.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:23 myhugefile.txt
-rwxrwxr-x 1 fulano fulano       537 Aug 25 14:28 split_huge_01.sh
-rw-rw-r-- 1 fulano fulano     16018 Aug 25 14:59 STEPS.txt
$ grep -H -n -E '= ITERATION 250000 =' myhugefile_grep.txt
myhugefile_grep.txt:3749986:========== ITERATION 250000 ==========
$ grep -H -n -E '= ITERATION 249999 =' myhugefile_grep.txt
myhugefile_grep.txt:3749971:========== ITERATION 249999 ==========
$ sed -n -e '3749971,3749991p' -e '3749991q' myhugefile_grep.txt > myhugefile_grep.portion00$ nano myhugefile_grep.portion00
$ time (head -n 3749970 myhugefile_grep.txt; cat myhugefile_grep.portion00; sed -e '1,3749991d' myhugefile_grep.txt) > myhugefile_grep.00
real 0m25.135s
user 0m5.548s
sys 0m2.756s
$ grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' myhugefile_grep.
myhugefile_grep.00         myhugefile_grep.portion00  myhugefile_grep.txt      
$ grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' myhugefile_grep.*
myhugefile_grep.00:3749986:========== ITERACION #250000 ==========
myhugefile_grep.00:7499986:========== ITERATION 500000 ==========
myhugefile_grep.00:11249986:========== ITERATION 750000 ==========
myhugefile_grep.00:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.portion00:16:========== ITERACION #250000 ==========
myhugefile_grep.txt:3749986:========== ITERATION 250000 ==========
myhugefile_grep.txt:7499986:========== ITERATION 500000 ==========
myhugefile_grep.txt:11249986:========== ITERATION 750000 ==========
myhugefile_grep.txt:14999986:========== ITERATION 1000000 ==========
$ grep -H -n -E '= ITERATION 499999 =' myhugefile_grep.txt myhugefile_grep.txt:7499971:========== ITERATION 499999 ==========
$ sed -n -e '7499971,7499991p' -e '7499991q' myhugefile_grep.txt > myhugefile_grep.portion2
$ ls -l
total 6590972
-rwxrwxr-x 1 fulano fulano       216 Aug 25 12:22 build_huge_02.sh
-rw-rw-r-- 1 fulano fulano      6470 Aug 25 14:25 errors.txt
-rw-rw-r-- 1 fulano fulano       848 Aug 25 13:29 huge_parser_00.rb
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:25 myhugefile_00.txt
-rw-rw-r-- 1 fulano fulano 749888897 Aug 25 15:08 myhugefile_grep.00
-rw-rw-r-- 1 fulano fulano       998 Aug 25 15:05 myhugefile_grep.portion00
-rw-rw-r-- 1 fulano fulano       997 Aug 25 15:12 myhugefile_grep.portion2
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:28 myhugefile_grep.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 13:49 myhugefile_ruby.new00
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:27 myhugefile_ruby.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 14:33 myhugefile_split.new00
-rw-rw-r-- 1 fulano fulano  49922254 Aug 25 14:27 myhugefile_split.part.aa
-rw-rw-r-- 1 fulano fulano  49966560 Aug 25 14:27 myhugefile_split.part.ab
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ac
-rw-rw-r-- 1 fulano fulano  50000028 Aug 25 14:30 myhugefile_split.part.ad
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ae
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.af
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.ag
-rw-rw-r-- 1 fulano fulano  49999893 Aug 25 14:30 myhugefile_split.part.ah
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ai
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.aj
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ak
-rw-rw-r-- 1 fulano fulano  50000082 Aug 25 14:31 myhugefile_split.part.al
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.am
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.an
-rw-rw-r-- 1 fulano fulano  50000083 Aug 25 14:32 myhugefile_split.part.ao
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:26 myhugefile_split.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:23 myhugefile.txt
-rwxrwxr-x 1 fulano fulano       537 Aug 25 14:28 split_huge_01.sh
-rw-rw-r-- 1 fulano fulano     16018 Aug 25 14:59 STEPS.txt
$ nano myhugefile_grep.portion2
$ time (head -n 7499970 myhugefile_grep.txt; cat myhugefile_grep.portion2; sed -e '1,7499991d' myhugefile_grep.txt) > myhugefile_grep.2
real 0m25.613s
user 0m4.940s
sys 0m2.656s
$ grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' myhugefile_grep.*
myhugefile_grep.00:3749986:========== ITERACION #250000 ==========
myhugefile_grep.00:7499986:========== ITERATION 500000 ==========
myhugefile_grep.00:11249986:========== ITERATION 750000 ==========
myhugefile_grep.00:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.2:3749986:========== ITERATION 250000 ==========
myhugefile_grep.2:7499986:========== ITERACION #500000 ==========
myhugefile_grep.2:11249986:========== ITERATION 750000 ==========
myhugefile_grep.2:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.portion00:16:========== ITERACION #250000 ==========
myhugefile_grep.portion2:16:========== ITERACION #500000 ==========
myhugefile_grep.txt:3749986:========== ITERATION 250000 ==========
myhugefile_grep.txt:7499986:========== ITERATION 500000 ==========
myhugefile_grep.txt:11249986:========== ITERATION 750000 ==========
myhugefile_grep.txt:14999986:========== ITERATION 1000000 ==========
$ time (head -n 7499970 myhugefile_grep.00; cat myhugefile_grep.portion2; sed -e '1,7499991d' myhugefile_grep.00) > myhugefile_grep.2
real 0m27.239s
user 0m4.812s
sys 0m2.984s
$ grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' myhugefile_grep.*
myhugefile_grep.00:3749986:========== ITERACION #250000 ==========
myhugefile_grep.00:7499986:========== ITERATION 500000 ==========
myhugefile_grep.00:11249986:========== ITERATION 750000 ==========
myhugefile_grep.00:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.2:3749986:========== ITERACION #250000 ==========
myhugefile_grep.2:7499986:========== ITERACION #500000 ==========
myhugefile_grep.2:11249986:========== ITERATION 750000 ==========
myhugefile_grep.2:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.portion00:16:========== ITERACION #250000 ==========
myhugefile_grep.portion2:16:========== ITERACION #500000 ==========
myhugefile_grep.txt:3749986:========== ITERATION 250000 ==========
myhugefile_grep.txt:7499986:========== ITERATION 500000 ==========
myhugefile_grep.txt:11249986:========== ITERATION 750000 ==========
myhugefile_grep.txt:14999986:========== ITERATION 1000000 ==========
$ grep -H -n -E '= ITERATION 749999 =' myhugefile_grep.2
myhugefile_grep.2:11249971:========== ITERATION 749999 ==========
$ sed -n -e '11249971,11249991p' -e '11249991q' myhugefile_grep.2 > myhugefile_grep.portion3$ nano myhugefile_grep.portion3
$ time (head -n 11249970 myhugefile_grep.2; cat myhugefile_grep.portion3; sed -e '1,11249991d' myhugefile_grep.2) > myhugefile_grep.3
real 0m25.166s
user 0m4.644s
sys 0m2.356s
$ grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' myhugefile_grep.*
myhugefile_grep.00:3749986:========== ITERACION #250000 ==========
myhugefile_grep.00:7499986:========== ITERATION 500000 ==========
myhugefile_grep.00:11249986:========== ITERATION 750000 ==========
myhugefile_grep.00:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.2:3749986:========== ITERACION #250000 ==========
myhugefile_grep.2:7499986:========== ITERACION #500000 ==========
myhugefile_grep.2:11249986:========== ITERATION 750000 ==========
myhugefile_grep.2:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.3:3749986:========== ITERACION #250000 ==========
myhugefile_grep.3:7499986:========== ITERACION #500000 ==========
myhugefile_grep.3:11249986:========== ITERACION #750000 ==========
myhugefile_grep.3:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.portion00:16:========== ITERACION #250000 ==========
myhugefile_grep.portion2:16:========== ITERACION #500000 ==========
myhugefile_grep.portion3:16:========== ITERACION #750000 ==========
myhugefile_grep.txt:3749986:========== ITERATION 250000 ==========
myhugefile_grep.txt:7499986:========== ITERATION 500000 ==========
myhugefile_grep.txt:11249986:========== ITERATION 750000 ==========
myhugefile_grep.txt:14999986:========== ITERATION 1000000 ==========
$ grep -H -n -E '= ITERATION 999999 =' myhugefile_grep.3
myhugefile_grep.3:14999971:========== ITERATION 999999 ==========
$ sed -n -e '14999971,14999991p' -e '14999991q' myhugefile_grep.3 > myhugefile_grep.portion4$ nano myhugefile_grep.portion4
$ time (head -n 14999970 myhugefile_grep.3; cat myhugefile_grep.portion3; sed -e '1,14999991d' myhugefile_grep.3) > myhugefile_grep.new00
real 0m26.308s
user 0m4.136s
sys 0m2.436s
$ grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' myhugefile_grep.*
myhugefile_grep.00:3749986:========== ITERACION #250000 ==========
myhugefile_grep.00:7499986:========== ITERATION 500000 ==========
myhugefile_grep.00:11249986:========== ITERATION 750000 ==========
myhugefile_grep.00:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.2:3749986:========== ITERACION #250000 ==========
myhugefile_grep.2:7499986:========== ITERACION #500000 ==========
myhugefile_grep.2:11249986:========== ITERATION 750000 ==========
myhugefile_grep.2:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.3:3749986:========== ITERACION #250000 ==========
myhugefile_grep.3:7499986:========== ITERACION #500000 ==========
myhugefile_grep.3:11249986:========== ITERACION #750000 ==========
myhugefile_grep.3:14999986:========== ITERATION 1000000 ==========
myhugefile_grep.new00:3749986:========== ITERACION #250000 ==========
myhugefile_grep.new00:7499986:========== ITERACION #500000 ==========
myhugefile_grep.new00:11249986:========== ITERACION #750000 ==========
myhugefile_grep.new00:14999986:========== ITERACION #750000 ==========
myhugefile_grep.portion00:16:========== ITERACION #250000 ==========
myhugefile_grep.portion2:16:========== ITERACION #500000 ==========
myhugefile_grep.portion3:16:========== ITERACION #750000 ==========
myhugefile_grep.portion4:16:========== ITERACION #1000000 ==========
myhugefile_grep.txt:3749986:========== ITERATION 250000 ==========
myhugefile_grep.txt:7499986:========== ITERATION 500000 ==========
myhugefile_grep.txt:11249986:========== ITERATION 750000 ==========
myhugefile_grep.txt:14999986:========== ITERATION 1000000 ==========
$ ls -l
total 8787940
-rwxrwxr-x 1 fulano fulano       216 Aug 25 12:22 build_huge_02.sh
-rw-rw-r-- 1 fulano fulano      6470 Aug 25 14:25 errors.txt
-rw-rw-r-- 1 fulano fulano       848 Aug 25 13:29 huge_parser_00.rb
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:25 myhugefile_00.txt
-rw-rw-r-- 1 fulano fulano 749888897 Aug 25 15:08 myhugefile_grep.00
-rw-rw-r-- 1 fulano fulano 749888898 Aug 25 15:17 myhugefile_grep.2
-rw-rw-r-- 1 fulano fulano 749888899 Aug 25 15:22 myhugefile_grep.3
-rw-rw-r-- 1 fulano fulano 749888899 Aug 25 15:28 myhugefile_grep.new00
-rw-rw-r-- 1 fulano fulano       998 Aug 25 15:05 myhugefile_grep.portion00
-rw-rw-r-- 1 fulano fulano       998 Aug 25 15:12 myhugefile_grep.portion2
-rw-rw-r-- 1 fulano fulano       998 Aug 25 15:21 myhugefile_grep.portion3
-rw-rw-r-- 1 fulano fulano       999 Aug 25 15:26 myhugefile_grep.portion4
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:28 myhugefile_grep.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 13:49 myhugefile_ruby.new00
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:27 myhugefile_ruby.txt
-rw-rw-r-- 1 fulano fulano 749888900 Aug 25 14:33 myhugefile_split.new00
-rw-rw-r-- 1 fulano fulano  49922254 Aug 25 14:27 myhugefile_split.part.aa
-rw-rw-r-- 1 fulano fulano  49966560 Aug 25 14:27 myhugefile_split.part.ab
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ac
-rw-rw-r-- 1 fulano fulano  50000028 Aug 25 14:30 myhugefile_split.part.ad
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ae
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.af
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.ag
-rw-rw-r-- 1 fulano fulano  49999893 Aug 25 14:30 myhugefile_split.part.ah
-rw-rw-r-- 1 fulano fulano  50000081 Aug 25 14:27 myhugefile_split.part.ai
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.aj
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.ak
-rw-rw-r-- 1 fulano fulano  50000082 Aug 25 14:31 myhugefile_split.part.al
-rw-rw-r-- 1 fulano fulano  50000027 Aug 25 14:27 myhugefile_split.part.am
-rw-rw-r-- 1 fulano fulano  49999892 Aug 25 14:27 myhugefile_split.part.an
-rw-rw-r-- 1 fulano fulano  50000083 Aug 25 14:32 myhugefile_split.part.ao
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:26 myhugefile_split.txt
-rw-rw-r-- 1 fulano fulano 749888896 Aug 25 12:23 myhugefile.txt
-rwxrwxr-x 1 fulano fulano       537 Aug 25 14:28 split_huge_01.sh
-rw-rw-r-- 1 fulano fulano     16018 Aug 25 14:59 STEPS.txt
$ wc -l myhugefile_grep.txt
15000000 myhugefile_grep.txt
$ wc -l myhugefile_grep.new00
15000000 myhugefile_grep.new00
$




CONCLUSIONES


  • Las principales variables actrices son: espacio en disco, memoria, tiempo, instalación/configuración de otro software/programa (el caso de ruby).
  • Técnica 1 (ruby script): Más lenta, requiere el doble de espacio en disco, y tener una instalación ruby configurada.
  • Técnica 2 (split en n archivos): Requiere el doble de espacio en disco, es rápida, hay edición manual.
  • Técnica 3 (grep, head, sed, cat): Los comandos se ejecutan rápido, pero es menos automatizable, requiere más comandos y pasos, hay edición manual y requiere *más* espacio en disco.





REFERENCIAS







ANEXOS


  • 'build_huge_02.sh', shell script para construir un archivo gigante.

#!/bin/sh
# build_huge_02.sh
# Shell script to build a huge text file.
# 2013-08-24
counter=1
while [ "$counter" -le $1 ]
do
echo ========== ITERATION $counter ==========
ls
counter=$(($counter+1))
done
exit 0





  • 'huge_parser_00.rb', ruby script para parsear y reemplazar las coincidencias buscadas.

#!/usr/bin/env ruby
# huge_parser_00.rb
# Huge text file RegExp parser.
# 2013-08-25

matchers={
    %q/= ITERATION 250000 =/ => %q/= ITERACION #250000 =/,
    %q/= ITERATION 500000 =/ => %q/= ITERACION #500000 =/,
    %q/= ITERATION 750000 =/ => %q/= ITERACION #750000 =/,
    %q/= ITERATION 1000000 =/ => %q/= ITERACION #1000000 =/
}
matchers.each_pair { |m,r|
    STDERR.puts "%s: %s" % [ m, r ]
}
STDIN.each { |line|
    #STDERR.puts "line=#{line}"
    line.chomp!
    unless matchers.length == 0
        matchers.each_pair { |m,r|
            re=/#{m}/
            next if line[re].nil?
            line.sub!(re,r)
            STDERR.puts "Matched: #{m} -> #{r}"
            matchers.delete(m)
            break
        }
    end
    puts line
}

# Invoked like:
# $ ruby ./huge_parser_00.rb < myhugefile_ruby.txt > myhugefile_ruby.new00





  • 'split_huge_01.sh', shell script para iterar y buscar las partes de archivo de la técnica split.

#!/bin/sh
# split_huge_01.sh
# Shell script to handle huge text file.
# Technique: split.
# 2013-08-25
#split -n 7 myhugefile_split.txt myhugefile_split.part.
#split -l 1000000 myhugefile_split.txt myhugefile_split.part.
for CHUNK in myhugefile_split.part.* ; do
        grep -H -n -E '= ITERATION 250000 =|= ITERATION 500000 =|= ITERATION 750000 =|= ITERATION 1000000 =|= ITERACION #250000 =|= ITERACION #500000 =|= ITERACION #750000 =|= ITERACION #1000000 =' $CHUNK
done
#cat myhugefile_split.part.* > myhugefile_split.new00
exit 0

No comments: