multi-line record grep

Post on 13-Apr-2017

301 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

record-oriented grep

mlr-grep

ryo1kato@github @gmail @twitter @facebook

motivation

Want to "grep" multi-line entries in a file

✦ multi-line log files, or *.ini, etc. ✦ semi-structured text like an ifconfig output

2

for example...$ cat data.txt[one]twothree[foo]barbaz[hoge]piyohuga

3

} want to extract entire record lines that contains a pattern, where a record

Typical way

✦ grep -A 12 -B 34 -C 56 ✦ pcregrep --multiline ✦ awk -v RS='\n\n' "/$re/" ✦ perl -e …

4

But✦ pcregrep : You often need a very long regex.

✦ Note that it's NOT about finding multiline pattern (a pattern containing '\n'), but extract multiline record containing a pattern.

✦ AWK : Possible with using RS (need gawk) ✦Actually it's difficult to do it right using pcregrep or awk.

✦ perl, python : well, if you go that far ...5

But, do you want to write a one-liner / X script for these?

✦ zgrep ✦ grep -c (--count) ✦ grep -i (--ignore-case) ✦ grep -v (--invert-match) ✦ grep --color

6

So I wrote it for you!✦mlr-grep

✦Multi-Line Record Grep

✦AWK, Haskell, Python ✦ named amlgrep, hmlgrep, and pmlgrep ✦ They have almost identical features.

7

$ amlgrep 'ba' …[foo]barbaz

8

e.g.

} A whole record containing the pattern

✦ amlgrep - AWK implementation ✦ Needs gawk. ✦ Fastest ✦ --rs regex is slightly broken in RHEL5. ✦ Auto extract *.gz, *.bz2, and *.xz files ✦ --color, --count, --invert-match ✦ AND, OR of multiple keywords.

✦ hmlgrep - Haskell implementation ✦ Has almost same feature set as AWK ver. ✦ Sometimes 1.5~2x slower, with files with short lines and many matches.

✦ pymlgrep - Python implementation ✦ Slowest (4x of AWK version) ✦ Doesn't support multiple keywords

9

Multiple Keywords

10

$ amlgrep [--or] h t [FILE][one]twothree[hoge]piyohuga

≒ egrep 'h|t',

but fewer key types. 11

$ amlgrep --and h t [FILE][one]twothree

≒ egrep 'h.*t|t.*h' but fewer key types

12

--timestamp

multi-line log files with each entry begins

with timestamps13

$ cat datetime.log2014-01-23 12:34:56 log 1 foo bar2014-01-24 12:34:57 log 2 one two2014-01-25 12:34:58 log 3 hoge piyo

14

$ amlgrep -t 'one' … 2014-01-24 12:34:57 log 2 one two

15

$ amlgrep -t --dump foo

gawk -W re-interval -F \n -v RS='\n(((Mon|Tue|Wed|Thu|Fri|Sat),?[ \t]+)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Dec),?[ \t]*[0-9]{1,2},?[ \t][0-2][0-9]:[0-5][0-9](:[0-5][0-9])?(,?[ \t]20[0-9][0-9])?|20[0-9][0-9]-(0[0-9]|11|12)-(0[1-9]|[12][0-9]|3[01]))' '-v' 'ORS=' 'oldRT $0 ~ /foo/ {i++;if(substr(oldRT,1,1)=="\n"){h=substr(oldRT,2)}else{h=oldRT};;gsub(/foo/,"&",h);print h;gsub(/foo/, "&");print;if(RT != "")printf "\n"} {oldRT=RT} END{if (i>0){exit 0}else{exit 1}}'

16

Change the record separator✦ --rs '^$'

✦ Empty lines ✦ --rs '^----'

✦ Four or more dash ✦ --rs '^[[:alnum]]'

✦ Alphanumeric character on the first column. (For ifconfig like output)

✦ --rs '^\['

✦ A line begins with '[' (For *.ini files) ✦ --timestamp

≒ -rs '^(((Mon|Tue|Wed|Thu|Fri|Sat),?[\t]+)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Dec),?[ \t]*[0-9]{1,2},?[ \t][0-2][0-9]:[0-5][0-9](:[0-5][0-9])?(,?[ \t]20[0-9][0-9])?|20[0-9][0-9]-(0[0-9]|11|12)-(0[1-9]|[12][0-9]|3[01]))'

17

http://github.com/

ryo1kato/mlr-grep

18

top related