printer identification using supervised learning for

Printer Identification using Supervised Learning forDocument Forgery Detection

Sara ElkasrawiGerman Research Center

for Artificial Intelligence DFKI GmbH

D-67663 Kaiserslautern, Deutschland

[email protected]

Faisal ShafaitSchool of Computer Science and Software Engineering

The University of Western Australia

6009 Crawley, Perth, Australia

[email protected]

Abstract—Identifying the source printer of a document isimportant in forgery detection. The larger the number of docu-ments to be investigated for forgery, the less time-efficient manualexamination becomes. Assuming the document in question wasscanned, the accuracy of automatic forgery detection depends onthe scanning resolution. Low (100-200 dpi) and common (300-400dpi) resolution scans have less distinctive features than high-endscanner resolution, whereas the former is more widespread inoffices. In this paper, we propose a method to automaticallyidentify source printers using common-resolution scans (400dpi). Our method depends on distinctive noise produced byprinters. Independent of the document content or size, eachprinter produces noise depending on its printing technique,brand and slight differences due to manufacturing imperfections.Experiments were carried out on a set of 400 documents of similarstructure printed using 20 different printers. The documents werescanned at 400 dpi using the same scanner. Assuming constantsettings of the printer, the overall accuracy of the classificationwas 76.75%.

I. INTRODUCTION

Despite the integration of computer systems in most officesnowadays, paper-based documents still play an important rolein everyday life ranging from direct-value paper (money)transactions to governmental paperwork, trade deals, insurancepapers or different reports and receipts. A lot of these docu-ments do not contain secure marks which allows their forgeryusing low-cost commercial devices like scanners and printers.Companies and agencies where large amounts of paper-baseddocuments (such as receipts and bills) are processed auto-matically without proper verification face similar problems.Consequently, forged documents in such cases are handledby the system without investigation which results in possiblefinancial losses. Due to financial and practical constraints,embedding security features in low-security documents israrely adopted. Examination through sophisticated tools orwith utilization of high quality scanners is also impracticalfor companies processing large amounts of paper.

One of the scenarios to identify forged documents is byidentifying the source printer. According to Ali et al. [1] andGebhardt et al. [2] printers produce noise depending on theirprinting techniques and brands. Aging mechanical parts alsoaffect the quality of printed documents. Filtering noise inthe printed documents and finding relevant features that cancharacterize this noise using low-resolution scans can be usedas a key feature for detecting forged documents.

Most of the document forgery detection approaches canbe categorized into two main groups. One tackles printeridentification in order to verify if the document in question hasbeen printed by the original printer (see [1], [2], [3], [4], [5]and [6]). The other examines the document for irregularitiesthat might have occurred during modification or fraud ( [7],[8] and [9]).

The approach by van Beusekom et al. [3] detects embeddedyellow point patterns in a document that are manufacturer-specific characteristics of the printer. These patterns howeverappear only on colored prints. Another method (see [10])makes use of quasiperiodic banding effects on the printedpaper to identify the printer type. This approach is not appli-cable for text-only documents due to the lack of a wide rangeof grey-levels. In [1], the author extended this approach byprojecting signals from each letter and applying a classifier toclassify the printer. The tests however were applied to a limitednumber of printers (six known printers and one unknown)and the documents contained from 40-100 letters which isapproximately ten lines depending on the font. Our approachin comparison to this one is independent of the number oftext-lines present in the document.

Further approaches for printer recognition have been pre-sented by Mikkilineni et al. [4] where a graylevel co-occurencematrix is used to obtain texture features. Those are usedto classify documents from different printers. This approachdepends on scanning at a high resolution of 2400 dpi requiringhigh quality scanners. In our approach we target low-resolutionscans as high quality scanners are less common.

Several methods for printing technique recognition havebeen implemented by Schulze et al. [5] that examine thequality of the printed characters based on the assumption thatdifferent printing methods produce different effects on theprinted material. Another method implemented by Schreyeret al. [6] uses discrete cosine transform (DCT) features torecognize whether documents have been printed using aninkjet/laserjet printer or photocopied. Tests were performedusing only one source document, printed from each of theexamined printers. A similar approach [2] examines the char-acter edges for high variance in the gray-level and classifythe documents into either laserjet or inkjet using unsupervisedapproaches. Their work is based on the assumption that edgeroughness is a characteristic of the printer, which is capturedby the variance of the grey-levels at the edges of characters.

2014 11th IAPR International Workshop on Document Analysis Systems

978-1-4799-3243-6/14 $31.00 © 2014 IEEE

DOI 10.1109/DAS.2014.48

146

Detecting document irregularities has been tackled by [7]where document text lines are examined for misalignmentto detect a fraudulent modification. Similarly font differencesand over-similarity of characters within a document have beenexamined in [9] to detect forgery.

A system for detecting forgery in camera images basedon scanner noise analysis has been implemented by Khannaet al. [11]. The system assumes a unique noise pattern foreach scanner brand and selects statistical features of imagingsensor pattern noise. The method shows promising results fordetecting forgeries made through a combination of differentimages. These features have been used in our approach forprinter identification.

The rest of this paper is organized as follows. In Section II,we present features extracted from printed areas, followed bythe Section III where details of the experiment are explained.Then, we present the experiment results in Section IV andfinally conclude the evaluation of the experiment in Section V.

II. PRINTER SPECIFIC FEATURE EXTRACTION

To determine features characteristics of each printer, wefirst separate the printed area from the non-printed area for acloser examination of the printer-generated noise. Our assump-tion is that the printers would not leave any ink marks on theblank areas of the document, hence these areas are not relevantfor extracting printer specific features. This assumption mightnot hold for defected printers, but most of the functionalprinters satisfy this criteria. Furthermore, we assume that allpages have the correct skew and orientation [12] and non-text elements have been removed from the documents usingtext/image segmentation [13]. The main reason for consideringtext-only documents is to focus on examining printed areasoriginating from vector graphics. Printed half-tone regionshave additional factors that influence their printed quality,hence we chose to ignore such regions in this work. Text linesin the printed document are thresholded into background andforeground pixels. The original image is then subtracted fromthe thresholded image. The difference image represents thenoise and was used for extracting features to train a SupportVector Machine. In the following subsections each step isexplained.

A. Text Line Extraction

To examine printed areas in textual documents, text lineswere segmented with the help of Tesseract [14]. A sample ofthe text line boundaries is presented in Figure 1.

B. Image Filter and Noise Extraction

Image filtering was performed to obtain a clean imagefrom which the original image is subtracted afterwords toobtain noise patterns. Filtering the printed area is done byfirst calculating the Otsu’s threshold and getting the mediangray-level for the foreground as well as the median gray-levelfor the background pixels from the original image, using theOtsu binarized image as a mask. Hence, a clean bi-level imageis generated that has the gray-level values of all foreground /background pixels set to the median foreground / backgroundvalues thus calculated. A sample of a clean bi-level image andits original image is presented in Figure 2a and 2b.

To obtain an image representation of the noise, from whichfeatures are extracted, the original image is subtracted fromthe clean image. A sample of a noise image is presented inFigure 2c.

Inoise =

{Iclean − Ioriginal if Iclean ≥ Ioriginal,

255 + Iclean − Ioriginal otherwise(1)

C. Feature Extraction

Feature selection from the noise image is based on thework of Khanna et al. [11], where the author uses statisticalfeatures to describe pattern noise produced by flatbed scanners.The reason for using statistical features is to be independentof image content or size. Pattern noise introduced by scannersis mainly caused by imperfections during the manufacturingprocess which affect scanner sensors. In flatbed scanners, animage is translated by a linear sensor array along the length ofthe scanner, resulting in each row of the scanned image beinggenerated by the same sensor. Thus, the average of all rowsgives an estimate for the fixed “row-patterns” [15].

To draw similarities between the scanning process and thetwo considered printing processes (i.e. inkjet and laserjet), anunderstanding of the printing process is necessary.Inkjet Printers place very small drops of ink on the paper,whose diameter ranges from 50 to 60 microns. They havea print-head that moves back-and-forth horizontally whiledissipating ink onto paper as it moves through the printer.A laserjet printer has a drum that is initially positively charged.When a document is to be printed, a laser beam dischargescertain areas on the drum, that correspond to the contentbeing printed (e.g. letters). Afterwards, positively charged inkis placed on the drum, and is drawn only by the dischargedareas. The printing paper is charged negatively to attract inkfrom the drum and then discharged for it not to cling to thedrum. Finally the paper is heated up to melt the ink on thepaper [16].

Laserjet and inkjet printing processes are similar in thatthey proceed horizontally. Analogously, scanning also proceedshorizontally. Building on the analogy, features extraction isdone as explained below.

Let Inoise denote an M×N (M rows and N columns) noiseimage. Averages of the image columns and rows, denoted byIrnoise and Icnoise respectively, are calculated as follows:

Irnoise(j) =1

M

M∑i=1

Inoise(i, j); 1 <= j <= N (2)

Icnoise(i) =1

N

N∑j=1

Inoise(i, j); 1 <= i <= M. (3)

Note that Irnoise(j) is N -dimensional and Icnoise(i) is M -dimensional. The correlation between each row of the noiseimage and the average of all columns, denoted by prow(i) =C(Irnoise, Inoise(i, .)), as well as that of each column and averageof all rows, denoted by pcol(j) = C(Icnoise, Inoise(., j)) arecalculated.

147

Fig. 1: A sample result of line boundaries

(a) Scanned original document image (b) Cleaned bi-level document image afterreplacing all foreground / backgroundpixels in the original image with their

median values.

(c) Noise image obtained after subtractingthe original image from the cleaned bi-level

image.

Fig. 2: Samples from original, filtered (clean) image and noise image

Selected are 15 features from a grayscale image represent-ing printing noise; The mean, standard deviation, skewness andkurtosis of prow and pcol are the first 8 features. The standarddeviation, skewness and kurtosis of Irnoise and Icnoise representfeatures 9 to 14. Feature number 15 is a measure of relativeperiodicity between noise in columns and rows:

f15 = (1−1N

∑Nj=1 pcol(j)

1M

∑Mi=1 prow(i)

) ∗ 100. (4)

After extracting those features, an SVM is trained withdifferent parameters to achieve the best possible classification.

D. Example on Feature Extraction

Consider the given excerpt of a sample noise image:

94 222 252 0 2 218 139 2216 96 152 156 164 100 40 012 80 100 106 106 108 96 4644 172 234 240 244 242 234 16444 182 252 2 4 6 6 24844 176 250 2 2 6 8 6108 228 2 4 4 6 8 818 180 254 2 4 4 6 8

From equations 2 and 3, Irnoise =[47 167 187 64 66 86 67 62] and Icnoise =[118 90 81 196 93 61 46 59]. Thus the correlationvector of each row and the mean column noise is prow =

[0.76541908 0.40335458 0.30764977 0.228120770.63556492 0.9453724 0.37110938 0.96699924]

and the correlation vector of each col-umn and the mean row noise is pcol =[−0.03410681 −0.02542542 0.38378662 0.733884940.73277127 0.85778067 0.91560419 0.52314518]

.

Calculating the 15 feature for this sample image:[0.57794877 0.27238116 0.40142544 −1.447356270.51093008 0.35072175 −0.91527649 −0.7534536149.6027973 1.75227986 0.20677763 44.40720662

2.18270397 1.83316294 11.59595677]

III. EXPERIMENTAL SETUP

In this work, we used a dataset consisting of 400 documentimages, involving 20 different printers. Each printer was usedto print 20 different pages. Out of 20 different printers used,13 are laserjet printers1 and seven are inkjet printers2.

This dataset is a subset of the dataset developed byGebhardt et al. [2], containing only the contract documentcategory. A contract consists of horizontal text lines only and

1(1) laserjet1= ‘Samsung CLP 500‘, (2) laserjet2= ‘Ricoh Aficio MPC2550‘,(3) laserjet3= ‘HP LaserJet 4050‘, (4) laserjet4= ‘OKI C5600‘, (5) laserjet5=‘HP LaserJet 2200dtn‘, (6) laserjet6= ‘Ricoh Afico Mp6001‘, (7) laserjet7=‘HP Color LaserJet 4650dn‘, (8) laserjet8= ‘Nashuatec DSC 38 Aficio‘, (9)laserjet9= ‘Canon LBP7750 cdb‘, (10) laserjet10= ‘Canon iR C2620‘, (11)laserjet11= ‘Hp Laserjet 4350 o.4250‘, (12) laserjet12= ‘Hp Laserjet 5‘, (13)laserjet13= ‘Epson Aculaser C1100‘

2(1) inkjet1= ‘Officejet 5610‘, (2)inkjet2= ‘Epson Stylus Dx 7400‘, (3)inkjet3= unknown (4) inkjet4= ‘Canon MX850‘, (5)inkjet5= ‘Canon MP630‘,(6)inkjet6= ‘Canon MP64D‘, (7) inkjet7= unknown

148

of different page length, however most documents containmore than 20 text-lines. The other two document categoriesin [2] are invoices and scientific papers. As mentioned earlier,we focused in this work on printer noise present in text linesonly. Therefore, we ignored the other two categories due tothe presence of logos, half-tones, tables, and graphics.

For the first experiment, contracts from six printers weremanually selected from the dataset mentioned above (120documents). The six printers were chosen of different brandsto minimize the probability of them having similar mechanicalparts. Three of those printers, namely Samsung CLP 500,Ricoh Aficio MPC2550 and HP LaserJet 4050, were laserjetprinters whereas Officejet 5610, Epson Stylus Dx 7400 andCanon MX850 were inkjet printers.

The second experiment was performed in three differentsettings. The first one included documents printed by thelaserjet printers only (260 documents), the second one includeddocuments printed by the inkjet printers only (140 documents),and the third setting included the whole dataset (400 docu-ments).

For the evaluation of classification results, the accuracy waschosen as a metric. Accuracy is defined as:

accuracy =tp+ tn

tp+ fp+ tn+ fn(5)

where true positives (tp) stands for the number of correctlyclassified samples, false positives (fp) for the number ofwrongly classified samples, true negatives (tn) for the numberof correctly rejected samples and false negatives (fn) for thenumber wrongely rejected samples.

For evaluation of the classification per class, precision andrecall measures were used:

precision =tp

tp+ fp(6)

and

recall =tp

tp+ fn. (7)

The SVM training and testing were done using the Rapid-Miner3 SVM package. Features used for the SVM were firstnormalized prior to training to achieve higher classificationaccuracy. One SVM was trained for each Experiment set. Thekernel function we chose was a polynomial kernel. To selectthe best parameters for the kernel a grid search was performed.

IV. RESULTS

In the first experiment, the overall accuracy was 90%.Values of precision and recall for the different printers areshown in Figure 3. The figures show no significant distinctionbetween inkjet printers classification and laserjet printer classi-fication. Missclassification occurred between printers of similartechnology. For example, most missclassified documents ofthe Samsung laserjet printer were classified as Ricoh samples,which is a laserjet printer as well.

The second experiment on the whole dataset, resulted inan accuracy of 76.75%. Values of precision and recall for

3http://www.rapidminer.com/

O.5610(inkjet)

S.CLP500(laserjet)

R.A.MPC2550(laserjet)

HPL.4050(laserjet)

E.S.Dx7400(inkjet)

C.MX850(inkjet)

Printer

0

20

40

60

80

100

Percentage

Precision

Recall

Fig. 3: Precision and Recall for the 6-printers dataset

inkjet1

inkjet2

inkjet3

inkjet4

inkjet5

inkjet6

inkjet7

laserjet1

laserjet2

laserjet3

laserjet4

laserjet5

laserjet6

laserjet7

laserjet8

laserjet9

laserjet10

laserjet11

laserjet12

laserjet13

Printer

0

20

40

60

80

100Percentage

Precision

Recall

Fig. 4: Precision and Recall for the 21-printers dataset

each printer are show in Figure 4. Naturally the accuracydegrades as documents from more printers are included inthe classification. The lowest values for precision and recall,namely those of laserjet3 and laserjet10, suggest that laserjetprinters might be harder to identify than inkjet printers.

Comparing the results of classifying the printer only amongthe laserjet or inkjet printers, we see that inkjet printers arebetter identified than laserjet with an accuracy of 93.57%(Figure 5). Laserjet-printed documents classification resultedin an accuracy of 78.46% (Figure 6).

Inkjet printers are better distinguishable than laserjet onesdue to the fact that inkjet printers, as opposed to laserjetprinters, produce more unsharp edges [2]. This allows for morenoise to be printed which represents the distinguishing featuresof the source printer.

149

inkjet1

inkjet2

inkjet3

inkjet4

inkjet5

inkjet6

inkjet7

Printer

0

20

40

60

80

100

Percentage

Precision

Recall

Fig. 5: Precision and Recall for the inkjet printers dataset

laserjet1

laserjet2

laserjet3

laserjet4

laserjet5

laserjet6

laserjet7

laserjet8

laserjet9

laserjet10

laserjet11

laserjet12

laserjet13

Printer

0

20

40

60

80

100

Percentage

Precision

Recall

Fig. 6: Precision and Recall for the laserjet printers dataset

V. CONCLUSION

Using a dataset of documents from printers of differenttypes and brands, our approach has shown promising resultsin identifying their source printers. The overall classificationaccuracy for the whole dataset was 76.75%. For inkjet-printeddocuments, a classification accuracy of 93.57% was achieved.Our approach performs better in identifying source inkjetprinters as opposed to laserjet ones, as documents printed bythe former type contain more characteristic noise than thoseof the latter.

Further enhancement could be achieved by finer segmen-tation of the printed area aiming to improve noise analysis.Experimenting on documents of different formats (e.g. tabular,graphic) would also be useful for testing our approach.

ACKNOWLEDGMENTS

This research work was partially funded by The Universityof Western Australia’s FECM research grant.

REFERENCES

[1] G. N. Ali, A. K. Mikkilineni, P.-J. Chiang, J. P. Allebach, G. T.Chiu, and E. J. Delp, “Application of principal components analysisand gaussian mixture models to printer identification,” in InternationalConference on Digital Printing Technologies, vol. 20, 2004, pp. 301–305.

[2] J. Gebhardt, M. Goldstein, F. Shafait, and A. Dengel, “Document au-thentication using printing technique features and unsupervised anomalydetection,” in Proceedings of the 12th International Conference onDocument Analysis and Recognition. Washington, DC, USA: IEEEComputer Society, 8/2013, pp. 479–483.

[3] J. van Beusekom, F. Shafait, and T. M. Breuel, “Automatic authen-tication of color laser print-outs using machine identification codes,”Pattern Analysis and Applications, vol. 16, no. 4, pp. 663–678, 2013.

[4] A. K. Mikkilineni, P.-J. Chiang, G. N. Ali, G. T.-C. Chiu, J. P.Allebach, and E. J. Delp, “Printer identification based on graylevel co-occurrence features for security and forensic applications,” in Security,Steganography, and Watermarking of Multimedia Contents, 2005, pp.430–440.

[5] C. Schulze, M. Schreyer, A. Stahl, and T. M. Breuel, “Evaluationof graylevel-features for printing technique classification in high-throughput document management systems,” in Proc Int Workshop onComputational Forensics. Springer, 2008, pp. 35–46.

[6] M. Schreyer, C. Schulze, A. Stahl, and W. Effelsberg, “Intelligentprinting technique recognition and photocopy detection for forensicdocument examination.” in Informatiktage, 2009, pp. 39–42.

[7] J. van Beusekom, F. Shafait, and T. M. Breuel, “Text-line examinationfor document forgery detection,” Int Jour on Document Analysis andRecognition, vol. 16, no. 2, pp. 189–207, 2013.

[8] J. van Beusekom, F. Shafait, and T. Breuel, “Automatic line orienta-tion measurement for questioned document examination,” in Proc IntWorkshop on Computational Forensics. Springer, 2009, pp. 165–173.

[9] R. Bertrand, P. Gomez-Kromer, O. Ramos Terrades, P. Franco, andJ. Ogier, “A system based on intrinsic features for fraudulent documentdetection,” in Proceedings of the 12th International Conference onDocument Analysis and Recognition. Washington, DC, USA: IEEEComputer Society, 8/2013.

[10] G. N. Ali, P. Chiang, A. Mikkilineni, J. P. Allebach, G. T.-C. Chiu, andE. J. Delp, “Intrinsic and extrinsic signatures for information hidingand secure printing with electrophotographic devices,” in Proceedingsof the IS&Ts NIP19: International Conference on Digital PrintingTechnologies, vol. 19, 2003, pp. 511–515.

[11] N. Khanna, G. T. Chiu, J. P. Allebach, and E. J. Delp, “Scanneridentification with extension to forgery detection,” in Electronic Imaging2008. International Society for Optics and Photonics, 2008, pp.68 190G–68 190G.

[12] J. van Beusekom, F. Shafait, and T. M. Breuel, “Combined orientationand skew detection using geometric text-line modeling,” Int Jour onDocument Analysis and Recognition, vol. 13, no. 2, pp. 79–92, 2010.

[13] S. S. Bukhari, M. I. A. Al-Azawi, F. Shafait, and T. M. Breuel, “Docu-ment image segmentation using discriminative learning over connectedcomponents,” in Int Workshop on Document Analysis Systems, 2010,pp. 183–190.

[14] R. Smith, “An overview of the Tesseract OCR engine,” in DocumentAnalysis and Recognition, 2007. ICDAR 2007. Ninth InternationalConference on, vol. 2. IEEE, 2007, pp. 629–633.

[15] N. Khanna, A. K. Mikkilineni, and E. J. Delp, “Scanner identificationusing feature-based processing and analysis,” Information Forensics andSecurity, IEEE Transactions on, vol. 4, no. 1, pp. 123–139, 2009.

[16] M. Schreyer, “Intelligent printing technique recognition and photocopydetection using digital image analysis for forensic document examina-tion,” Master’s thesis, University of Mannheim, 2008.

150

printer identification using supervised learning for

Documents