detecting hidden file system problems
DESCRIPTION
Detecting Hidden File System Problems. Nicholas P. Cardo National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory [email protected]. The Dilemma. File system problems can exist Sometimes there are no errors How to detect a non-error problem - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/1.jpg)
Detecting Hidden File System Problems
Nicholas P. CardoNational Energy Research Scientific Computing Center
Lawrence Berkeley National [email protected]
![Page 2: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/2.jpg)
•File system problems can exist•Sometimes there are no errors•How to detect a non-error problem•How to do it quickly
The Dilemma
# grep -i lustre messages | grep -i error | wc –l5836 (6 days)
![Page 3: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/3.jpg)
3
Words of Wisdom
“Strive for perfection in everything. Take the best that exists and make it better. If it does not exist, create it. Accept nothing nearly right or good enough.”
Sir Henry Royceco-founder of Rolls-Royce
![Page 4: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/4.jpg)
4
•llapi_ping ….. ping lustre components•llapi_file_create … create a file w/stripes•llapi_file_get_stripe … read stripe info
Useful API Calls
![Page 5: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/5.jpg)
5
Wanted: Dead or Alive
hc=/proc/fs/lustre/health_check
for node in $mds $ossdo
xx=`ssh $node cat $hc`if [ “$xx” != “healthy” ]thenecho “$node not healthy”fi
done
![Page 6: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/6.jpg)
6
OST Ping
/proc/fs/lustre/osc
readdir
llapi_ping
ping failed
., ..,num_refs DT_DIR
rc=0
Y
Y
Y
Y
N
N
NN
![Page 7: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/7.jpg)
7
How About
dir = opendir("/proc/fs/lustre/osc");
while((d = readdir(dir)) != NULL) {
if (!strcmp(d->d_name,".")) continue; if (!strcmp(d->d_name,"..")) continue; if (!strcmp(d->d_name,"num_refs")) continue;
if ( d->d_type == DT_DIR ) { if ((llrc=llapi_ping("osc”,d->d_name)) != 0) fprintf(stderr,”problem %s\n”,osc); }}
![Page 8: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/8.jpg)
8
Benchmarking
“benchmark: to study (as a competitor’s product or business practices) in order to improve the performance of one’s own company.”
www.webster.com
“Benchmarking is a process of comparing one’s business processes and performance metrics to industry bests and/or best practices from other industries. Dimensions typically measured are quality, time and cost, improvements from learning mean doing things better, faster, and cheaper.”
www.wikipedia.org
Nick’s Corollary BM1: benchmarks can make anything look good.
![Page 9: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/9.jpg)
9
Create File
rc=0
get failed
stripe cnt
ost match
Y YY
N NN
llapi_file_create
llapi_file_get_stripe
ost mismatchcnt mismatch
![Page 10: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/10.jpg)
10
Create Striped Filerc = llapi_file_create(fname,0,ostnum,1,0);
lum = malloc(LOV_EA_MAX(lum));
/* read back the stripe data */rc = llapi_file_get_stripe(fname,lum);if (!rc) { /* check the stripe count */ if (lum->lmm_stripe_count != 1) { fprintf(stderr,"%s: stripe count mismatch: %s\n", ProgName,fname); } else { /* check the ost number */ oi = lum->lmm_objects[lum->lmm_stripe_offset].l_ost_idx; if (oi != ostnum){ fprintf(stderr,"%s: ost mismatch: %s\n", ProgName,fname); } }}
![Page 11: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/11.jpg)
11
Write Test
size
Start time
End time
MB/swrite
Y
N
![Page 12: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/12.jpg)
12
time(&b_timval); /* start the clock */
for (cnt=0;cnt<fcnt;cnt++) write(ifd,pattern,sizeof(pattern));
time(&e_timval); /* stop the clock */
fsync(ifd); /* flush the cache */
/* * calculate write mega bytes per second */wmbs=(float)fcnt*sizeof(pattern)/(float)(e_timval-b_timval)/1048576;
![Page 13: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/13.jpg)
13
Read Test
size
Start time
End time
MB/sread
Y
N
![Page 14: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/14.jpg)
14
lseek(ifd,0,SEEK_SET); /* rewind */
time(&b_timval); /* start the clock */
for(cnt=0;cnt<fcnt;cnt++) read(ifd,buf,1024);
time(&e_timval); /* stop the clock */
/* * calculate read mega bytes per second */rmbs=(float)fcnt*sizeof(pattern)/(float)(e_timval-b_timval)/1048576.0;
![Page 15: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/15.jpg)
15
Data Check
size
read
Y
N
seek 0
patternY N
Pattern mismatch
![Page 16: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/16.jpg)
16
lseek(ifd,0,SEEK_SET);/* rewind */
/* * repeat the read, but this time check the results * to make sure that what we read matches what we wrote */for(cnt=0;cnt<fcnt;cnt++) { read(ifd,buf,sizeof(buf));
if(strncmp(buf,pattern,sizeof(pattern))) { fprintf(stderr,"%s: read/write mismatch: %s\n", ProgName,fname); rc = 1; break; }}
![Page 17: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/17.jpg)
17
Get OSTsdp = opendir("/proc/fs/lustre/lov"); while((de=readdir(dp)) != NULL) { /* we only want the requested file system */ if (strncmp(de->d_name,fsp,strlen(fsp))) continue;
sprintf(rc.fsname,"%s\0",fsname);
/* get the number of osts */ sprintf(pname,"/proc/fs/lustre/lov/%s/numobd\0",de->d_name);
ffd = fopen(pname,"r"); fscanf(ffd,"%d",&rc.numobd); fclose(ffd);
break;}
![Page 18: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/18.jpg)
18
In a Nutshell
MPI_Init Create
Write
Verify
Read
Get OSTs
Scale OK
![Page 19: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/19.jpg)
19
MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank);MPI_Comm_size (MPI_COMM_WORLD, &size);/* Parse the command line options */while ((optchr=getopt(argc,argv,"f:s:t:")) != EOF) { switch (optchr) { case 'f': fsname = optarg; break; case 's': fsize = atoi(optarg); break; case 't': tstdir = optarg; break; default : MPI_Abort(MPI_COMM_WORLD,1); }}fs = getfsinfo(fsname);if (size != fs->numobd) { fprintf(stderr,"%s: tasks(%d) != osts(%d)\n", ProgName,size,fs->numobd); MPI_Abort(MPI_COMM_WORLD,1);}/* construct the name of the test file */sprintf(fname,"%s/testfile.%s.%d\0",tstdir,fs->fsname,rank);
rc = StripeFile(fname,rank); /*create the striped file */
if(!rc) rc = RDWR(fname,rank,fsize,fsname);/* read/write test */
if (!rc) unlink (fname); /* delete the test file */MPI_Finalize();
![Page 20: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/20.jpg)
Case #1
0 5 10 15 20 25 30 35 40 45 50130.0
135.0
140.0
145.0
150.0
155.0
160.0
165.0
170.0
175.0
Scratch Write Scratch Read Scratch2 Write Scratch2 Read
![Page 21: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/21.jpg)
21
Case #2
Scratch Write Scratch Read
![Page 22: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/22.jpg)
22
Case #3
write read
![Page 23: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/23.jpg)
23
Case #3b
write read after write after read
![Page 24: Detecting Hidden File System Problems](https://reader036.vdocuments.us/reader036/viewer/2022062323/56816150550346895dd0d8e0/html5/thumbnails/24.jpg)
Thank You!
24