exposing homograph obfuscation intentions by coloring unicode …liuwy/publications/apweb2008... ·...

12
Y. Zhang et al. (Eds.): APWeb 2008, LNCS 4976, pp. 275 286, 2008. © Springer-Verlag Berlin Heidelberg 2008 Exposing Homograph Obfuscation Intentions by Coloring Unicode Strings Liu Wenyin, Anthony Y. Fu, and Xiaotie Deng Department of Computer Science, City University of Hong Kong, Hong Kong SAR., China {csliuwy, anthony, csdeng}@cityu.edu.hk http://antiphishing.cs.cityu.edu.hk Abstract. Unicode has become a useful tool for information internationaliza- tion, particularly for applications in web links, web pages, and emails. How- ever, many Unicode glyphs look so similar that malicious guys may utilize this feature to trick people’s eyes. In this paper, we propose to use Unicode string coloring as a promising countermeasure to this emerging threat. A coloring al- gorithm is designed and prototyped to assign colors to a set of required lan- guages/scripts such that each language/script is displayed uniquely in color, while the color difference among different languages is maximized. Based on that, we proposed both fixed and adaptive coloring schemes to render Unicode strings in weblinks and documents so as to distinguish mixed Unicode charac- ters from different language/script groups and vividly illustrate potential Homograph Obfuscation intentions. Our user study shows that it is helpful to remind end users of weirdly displayed strings. 1 Introduction Universal Character Set (UCS) is a union of characters/symbols of most languages in the world. It is more and more popular and important in our daily life. We use it to compose web links, web pages, and emails. However, there are many similar charac- ters in UCS, as shown in Figure 1. This could cause a severe web security problem. Malicious people can use various similar characters from different languages to mimic “citibank”, as shown in Figure 2. A real case is that another “paypal.com” (in which the second ‘a’ is U-0430 in Unicode) was successfully registered in 2005. Un- wary users (and even expert users) could be easily spoofed by such phishing scam to expose their security sensitive information, such as credit card number, password, etc. This attack is referred to as homograph attack [4] and can also be expanded to web- pages and emails, in which similar characters can be used to generate content with the same reading effect to human users but escapable of content based phishing detection and spam filtering. In this paper, we address this problem and propose a method based on coloring to differentiate “abnormal” weblinks, webpages, and emails from the relatively “normal” ones. We have ever briefly mentioned in [3] to color the Unicode strings to help end us- ers find out these weird strings, and therefore, to relieve the threat of this kind of phishing attack to end users. In this paper, we fully explore this idea and propose a fixed coloring scheme and an adaptive coloring scheme. Both schemes are based on

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Y. Zhang et al. (Eds.): APWeb 2008, LNCS 4976, pp. 275 – 286, 2008. © Springer-Verlag Berlin Heidelberg 2008

    Exposing Homograph Obfuscation Intentions by Coloring Unicode Strings

    Liu Wenyin, Anthony Y. Fu, and Xiaotie Deng

    Department of Computer Science, City University of Hong Kong, Hong Kong SAR., China {csliuwy, anthony, csdeng}@cityu.edu.hk

    http://antiphishing.cs.cityu.edu.hk

    Abstract. Unicode has become a useful tool for information internationaliza-tion, particularly for applications in web links, web pages, and emails. How-ever, many Unicode glyphs look so similar that malicious guys may utilize this feature to trick people’s eyes. In this paper, we propose to use Unicode string coloring as a promising countermeasure to this emerging threat. A coloring al-gorithm is designed and prototyped to assign colors to a set of required lan-guages/scripts such that each language/script is displayed uniquely in color, while the color difference among different languages is maximized. Based on that, we proposed both fixed and adaptive coloring schemes to render Unicode strings in weblinks and documents so as to distinguish mixed Unicode charac-ters from different language/script groups and vividly illustrate potential Homograph Obfuscation intentions. Our user study shows that it is helpful to remind end users of weirdly displayed strings.

    1 Introduction

    Universal Character Set (UCS) is a union of characters/symbols of most languages in the world. It is more and more popular and important in our daily life. We use it to compose web links, web pages, and emails. However, there are many similar charac-ters in UCS, as shown in Figure 1. This could cause a severe web security problem. Malicious people can use various similar characters from different languages to mimic “citibank”, as shown in Figure 2. A real case is that another “paypal.com” (in which the second ‘a’ is U-0430 in Unicode) was successfully registered in 2005. Un-wary users (and even expert users) could be easily spoofed by such phishing scam to expose their security sensitive information, such as credit card number, password, etc. This attack is referred to as homograph attack [4] and can also be expanded to web-pages and emails, in which similar characters can be used to generate content with the same reading effect to human users but escapable of content based phishing detection and spam filtering. In this paper, we address this problem and propose a method based on coloring to differentiate “abnormal” weblinks, webpages, and emails from the relatively “normal” ones.

    We have ever briefly mentioned in [3] to color the Unicode strings to help end us-ers find out these weird strings, and therefore, to relieve the threat of this kind of phishing attack to end users. In this paper, we fully explore this idea and propose a fixed coloring scheme and an adaptive coloring scheme. Both schemes are based on

  • 276 L. Wenyin, A.Y. Fu, and X. Deng

    the same coloring algorithm we proposed, which selects colors from a set of available colors and assigns them to a set of required languages/scripts such that each lan-guage/script is displayed uniquely in color, while the color difference among different languages is maximized. As the characters in Unicode belong to different language regions [6], a straightforward idea is to assign the characters in each language region with an identical color, as used in Quero Toolbar [10]. However, our study shows that there are actually similar characters from the same language, as shown in Figure 1(a). Nevertheless, some languages are sharing with the same set of characters, e.g., basic Latin characters are used by French, German, and Dutch, etc. Hence, we analyze the UCS first and design a grouping mechanism to classify the characters, and assign each group of characters with one specific color. The additional color property could provide users with more information for understanding Internationalized Resource Identifiers (IRIs) or document context semantics. According to our proposed coloring scheme, we prototyped this idea and conducted user study on this Unicode string coloring scheme. Our user study shows that well designed coloring algorithms can warn the users better against homograph obfuscation.

    A Ā Ă Ą (a) From Latin characters

    0041 0100 0102 0104

    行 行 (b) From CJK characters

    884C FA08

    银 銀 锒 鋃 (c) From CJK characters 94F6 9280 9512 92C3

    Fig. 1. Examples of visually similar characters in Unicode

    The rest of this paper is organized as follows. Section 2 introduces related work. In Section 3, we discuss the Unicode character coloring issues, which include character grouping, coloring palette construction, and the two coloring schemes. Section 4 demonstrates our prototype tool, Unicode String Illustrator. Section 5 presents our user study, which shows the effectiveness of the proposed coloring schemes. Finally, we conclude this paper in Section 6 and discuss future work in Section 7.

    2 Related Work

    2.1 Anti-phishing

    The Web facilitates our daily lives, but also causes a lot of security problems, includ-ing DoS, worms, DNS poisoning, and router attacks. In recent years, phishing has emerged very quickly as a new but serious threat. It was reported that U.S. consumers lost 630 million US dollars over the past two years to fraudulent phishing e-mail scams, c.f. Consumer Reports [12]. The amount of money lost to online banking fraud in the UK increased 55 percent to £22.5m in the first half of 2006, according to figures from banking industry body, c.f. Apacs [11]. As a consequence, phishing

  • Exposing Homograph Obfuscation Intentions by Coloring Unicode Strings 277

    becomes a hot topic to discuss in both the computer security society and the law en-forcement society.

    Most frequently used anti-phishing strategies focus on toolbars or extensions of Web browsers (e.g., Firefox and MS IE7). These can be classified into 5 main catego-ries by their methodologies.

    (1) Black/white list. The representative black/white list based systems include Phish-Tank SiteChecker, Google Safe Browsing, FirePhish, and CallingID Link Advi-sor, etc. The biggest challenge of this method is to maintain the black/white list. To maintain the black list, we need the help from the user community. A typical way is to collect the phishing reports from users and then process them by anti-phishing analysts (employed or volunteered). The drawback is that not all phish-ing weblinks will be reported and not all anti-phishing analysts are reliable and professional.

    (2) Reputation scoring. This technique can use reputation scores either reported from the anti-phishing community or computed from the given webpages, e.g., WOT and iTrustPage. Therefore the reliability and reputation scoring algorithm is cru-cial for this technique.

    (3) Malware detection. Malware is not phishing but it can be used to assist phishing. With the development of the anti-phishing technologies, ordinary and old phish-ing methods fail to work and thereby more phishers could turn to use malware to assist. The representative product is Finjan.

    (4) Relevant domain name suggestion. This technique suggests users the most rele-vant legitimate domain names when they are accessing the web, e.g. SpoofStick. The biggest challenge is to recognize the accessing webpage and make reason-able suggestions.

    (5) Personalized visual indicator. This technique is like posting a personalized stick note to the legitimate websites, such that we can recognize it when we access it again. The representative product is TrustBar. Such system assumes and relies on that the users can assign indicators correctly to real websites and always keep in mind that webpages without such indicator have phishing potentials.

    These tool bar based strategies are end-user level solutions only and heavily rely on the end-users’ education level, experience, and vigilance. However some research-ers have shown that security tool bars do not effectively prevent phishing attacks [14]. In addition, they also bring inconvenience to the end-users, including especially, too many false alarms. In [1], Liu et al proposed an active and more comprehensive anti-phishing strategy, which determines suspicious links from emails (at end-users’ email readers/senders, or enterprise email servers, e.g., bounced-backed emails) and all possible cousin domain names. The webpage pages at the suspicious links are taken and compared with the protected webpages. Visually similar webpages are reported. Especially, enterprise users can use this strategy for early detection of possible phish-ing Web pages and prepare themselves to clear potential security threats in their e-commerce environments without inconveniencing any end user.

    2.2 Homograph Obfuscation

    Gabrilovich et al demonstrated homograph attacks [4] that visually identical weblinks can be created in International Domain Name (IDN). Phishers can simply find similar

  • 278 L. Wenyin, A.Y. Fu, and X. Deng

    characters from UCS to replace certain ones in the legitimate web link to carry out the attack. Major Web browsers, such as Microsoft IE7 and Firefox, can display an IDN in Punycode (a translation of Unicode into the ASCII plane) in the address bar. How-ever, many users prefer to display their IDNs in their own language scripts. Because Punycode is an ASCII based encoding form rather than a user understandable form. It is to make IDN compatible to ASCII based DNSes. However, it is very difficult for a human to understand the meanings of the Punycode strings. Homograph attack is kind of Unicode attack [3], however, Unicode attack refers to more than visually similar web links but also semantically similar web links, as shown in [13]. There are also two tools, IRI/IDN SecuChecker [3] and REGAP [2], developed to fight against Uni-code attack. The Unicode attack detection tools can help ICANN [5] and DNS regis-trars to detect malicious registration applications.

    2.3 Coloring Techniques

    The standard color space is represented with the RGB system. Other color systems (such as Non-linear R'G'B', HSV, CIE L*a*b* and CIE L*u*v*, YCbCr, YIQ, and YUV etc.) have direct mappings to the RGB system. Each color system has its advantage for a specific representation. Although Non-linear R'G'B', YUV, and CIE L*u*v* are considered as better for differentiating colors, Riemersma’s investigation [8] shows that all of them have disadvantages and an approximated color distance assessment formula based on the RGB system is proposed, as shown in Eq. (1). Since about one out 12 people has color deficiency problem in the world, we need to consider using some color safe strategy in our Unicode coloring scheme and we found the web safe color palette proposed by BT (British Telecommunications) [7] is a quite good match for this requirement.

    BB

    GG

    RR

    RR

    CCB

    CCG

    CCR

    CCr

    ,2,1

    ,2,1

    ,2,1

    ,2,1

    2

    −=Δ−=Δ−=Δ

    +=

    2222,1 )256

    2552(4)

    2562()( B

    rGR

    rCCC Δ×−++Δ×+Δ×+=Δ

    (1)

    3 Unicode String Coloring

    3.1 Criteria for Coloring Scheme Design

    Coloring Unicode strings is not as straightforward as assigning colors randomly to each language/scripts type. A good design should have the following features.

  • Exposing Homograph Obfuscation Intentions by Coloring Unicode Strings 279

    Differentiability

    Visually similar characters could be identifiable from one another in the same context. A suitable color distance is necessary to measure the human’s perception to differen-tiate colors. We can use such measure to generate a reasonable color palette to use. By considering color-blindness users, we can use a “color-blindness safe palette” as the candidate color set. In addition, we should construct a color palette with as many colors as possible.

    Scalability

    UCS is a developing and growing repertoire, which means the Unicode consortium [6] is continuously adding more scripts to UCS to make it more “complete” in each new version. Hence, the scalability of the coloring scheme is important. We indeed hope the proposed coloring scheme is still workable after the Unicode consortium publishes new versions of UCS.

    Readability/Usability

    The coloring scheme is to vividly show the mixture of different languages in certain context. We need to make the coloring result comfortable for users. As we know, too colorful or contrastive images could make readers feel uneasy. Therefore, one big concern of the coloring scheme is the readability/usability problem.

    3.2 Unicode Character Grouping

    We first should divide all the Unicode characters into different language/script groups. Phishers may carry out Unicode attack by employing different language scripts into the same context. “www.citibank.com” is an example combining both Latin Basic and Cyrillic symbols. The highlighted “c” in this IDN is Cyrillic while the others are Latin Basic. Hence, the motivation of Unicode grouping is to list all of the elementary groups that we should assign colors to, such that we can finally distinguish how many and what languages/scripts are used in a particular Unicode string.

    It is reasonable to display characters of different languages/scripts in different col-ors. The Unicode consortium classified the Unicode characters into 11 region based groups and 122 subgroups based on different language scripts. We can assign one color to each subgroup. The complete list is available at [13]. However, these sub-groups do not represent the language / script difference sufficiently. Particularly, for instance, when we want to differentiate Simplified Chinese, Traditional Chinese, Japanese, and Korean in a Unicode string, we found that some of these characters are actually used by two or more languages. According to [6], the Unicode consortium merges all of the Chinese style characters into CJK Ideographs. If we simply present CJK Ideographs into an identical color, we cannot differentiate these four scripts. Therefore we need more specific level subgroups to replace CJK Ideographs. We denote the four types of scripts as S, T, J, and K, and we can generate 15 combina-tions to form subgroups from CJK Ideographs: S, T, J, K, ST, SJ, SK, TJ, TK, JK, STJ, STK, SJK, TJK, STJK, e.g., ST denotes the group containing only Simplified Chinese and Traditional Chinese but no Japanese and Korean.

  • 280 L. Wenyin, A.Y. Fu, and X. Deng

    3.3 Coloring Palette Construction

    Before we start the coloring process, we need to know how many and what colors are available for us to use. Such set of colors is referred to as the color palette. It should contain as many differentiable colors as possible. We also design the palette as color-blindness safe; because 5% to 8% of men and 0.5% of women in the world are born colorblind (We limit the discussion to protans (red weak) and deutans (green weak) because they make up 99% of this group). The department for older and disabled customers of British Telecom (BT) proposed a list of safe web colors for color-deficient vision in [7], which consists of 216 colors. We can simply choose colors from these colors to generate our coloring palette. In order to provide the best differ-entiability (as mentioned in Section 1.4), we have to rank the available colors in the palette from the most distinguishable to the least and we can use the ones having inter-distances greater than a threshold. This problem is equal to the Maximum Clique Problem [9], which is NP-Complete. To simplify the computation, we use a greedy algorithm (it is worth to note that greedy algorithm does not find the global optimal solution in most times, however, our experiment with approximated optimal algo-rithm [15] shows similar result to the greedy algorithm but with worse computational performance). In the greedy algorithm, we first choose the color which is the most distant to the background and foreground, and then recursively choose the one which is the most distant to the chosen ones. Finally, we obtain the color palette with the colors ranked in the descendant order of preference. Both the algorithm and the color-ing palette are available at [13]. Figure 3 (a) shows some samples from the coloring palette. The text in each color (e.g., in the nth row) shows its RGB value and the minimum color distance among all of the first n colors. E.g., “0 255 51 MinDis-tance=516” (green color in the 4th row) means the minimum distance among the first four colors is 516. The first color in the list does not have minimum distance to any previous color and is therefore considered as infinite.

    3.4 Unicode String Coloring

    Given a color palette, the Unicode string coloring algorithm assigns colors in the palette to each language/script subgroup of Unicode where we should follow the crite-ria in Section 1.4 for best of user experience. We propose two coloring schemes: Fixed Coloring Scheme and Adaptive Coloring Scheme.

    Fixed Coloring Scheme

    In the fixed coloring scheme, each language/script subgroup is assigned to a fixed color. Therefore, users will have a straightforward association between language and its dominant color, e.g., red to Chinese. The fixed coloring scheme is suitable for coloring IDNs and IRIs. Besides, we can also color the ASCII letters using the main foreground color (e.g., black) such that the highly-reputable websites, such as www.hsbc.com and www.citibank.com, are displayed as they are. It is also acceptable if we maintain a white list of domain names and not color (or just color using the main foreground color) the URLs in the white list. Figure 2 shows some examples of colored “citibank” using the fixed coloring scheme. In these examples, we use

  • Exposing Homograph Obfuscation Intentions by Coloring Unicode Strings 281

    UC-SimList_v0.9 [3], which is now referred to as Similar Unicode Character Index (SUCI), to find the similar characters to “citibank”. UC-SimList_v0.9 contains all of the visually similar character groups with 90% or higher similarity using the pixel-overlapping assessment method.

    Fig. 2. Examples of the fixed coloring scheme’s results

    Adaptive Coloring Scheme

    Sometimes, we may want to be aware of the diversity of language/script usage in a given basic background and foreground, and do not really care about the association between language/scripts and colors. Hence, we can use the given background and foreground colors as a start to calculate the remaining colors to generate an adaptive coloring palette in real-time with the algorithm in [13] and select as many colors as needed from the palette.

    We can use it to illustrate the diversity of language/script usage in webpage blocks or document context. Figure 3 demonstrates some examples colored by the adaptive coloring scheme. Obviously, we can understand the diversity of language/script usage after coloring. Suppose we have one Unicode string composed by three languages, as shown in Figure 3(a) (white as the background color and black as the foreground

  • 282 L. Wenyin, A.Y. Fu, and X. Deng

    color; the most frequently used language/script is assigned the foreground color). The result after rendering is shown in Figure 3(b). The adaptive coloring palette is in the column on the right side. Figure 3(c) and Figure 3(d) show more results using differ-ent color palettes.

    (a)

    Original context

    (b)

    BG=White,

    FG=Black

    255 255 255

    0 0 0 MinDistance=764

    204 0 255 MinDistance=517

    0 255 51 MinDistance=516

    255 102 0 MinDistance=431

    0 153 255 MinDistance=407

    (c)

    BG=Black,

    FG=Yellow

    0 0 0

    255 255 0 MinDistance=649

    0 204 255 MinDistance=579

    255 0 255 MinDistance=569

    0 204 0 MinDistance=408

    0 0 255 MinDistance=402

    (d)

    BG=LightYellow,

    FG=Red

    255 255 204

    255 0 0 MinDistance=585

    0 51 255 MinDistance=579

    0 204 0 MinDistance=526

    0 255 255 MinDistance=408

    0 0 0 MinDistance=402

    Fig. 3. Example results of the adaptive coloring scheme

    4 Unicode String Coloring Tool

    We prototyped the Unicode string coloring scheme and call it Unicode String Illustra-tor. It contains two parts, IRI Illustrator (the fixed coloring scheme) and Context Illus-trator (the adaptive coloring scheme).

    4.1 IRI Illustrator

    IRI Illustrator is a web browser plug-in. It replaces the function of the address bar. When users are accessing (by clicking or copy-and-pasting) phishing IRI(s), the lan-guage/scripts’ property will be illustrated by different colors.

    4.2 Context Illustrator

    Context Illustrator is a tool to present the language/script’s diversity in a context. The background color and foreground color are configurable. The coloring palette is automatically generated using the greedy algorithm. Figure 4 shows the coloring

  • Exposing Homograph Obfuscation Intentions by Coloring Unicode Strings 283

    (a) English

    (b1) Simplified Chinese (FG: Black, BG:

    White)

    (b2) Simplified Chinese (FG: Yellow, BG: Black)

    (c) Japanese

    Fig. 4. Demo of the Context Illustrator

    results of the Context Illustrator processing English, Chinese, and Japanese context with different background and foreground configurations.

    We can see that both the IRI Illustrator and the Context Illustrator can render Uni-code strings into a style that is effective to observe character-based obfuscation.

    5 User Study

    We have done a user study to show the effectiveness of our coloring scheme through user study, which consists of two parts: Part I is to test the effect of Unicode coloring on the subjects’ ability to recognized different languages; and Part II is to compare the usability of the two coloring approaches. Three questionnaires (QN1, QN2, and QN3) are used for both parts, and all questionnaires are available at [13]. QN1 tests the white/black coloring approach (WBA). In this approach, characters are always in black and the background is always white. QN2 tests the random coloring approach (RA). In this approach we assign a random color to each language/script group and display the characters with their assigned colors. QN3 tests our fixed and adaptive coloring approach (FAA, which is the coloring approach we presented in Section 3 and demonstrated in Section 4).

  • 284 L. Wenyin, A.Y. Fu, and X. Deng

    We recruited 15 subjects. None of them has previous knowledge of Unicode color-ing and none of them has color blindness disease. 5 subjects form a team and one team take one of questionnaires.

    Part I, Language/Script Usage Understanding We list a set of 11 Unicode strings for the subjects to read. These strings are com-posed by one or more language/script(s). The subjects are requested to count the number of used language/script(s) for each Unicode string.

    00.5

    11.5

    22.5

    33.5

    44.5

    5

    Co

    rrec

    t A

    nsw

    er N

    o.

    1 2 3 4 5 6 7 8 9 10 11

    Question No.

    QN1

    QN2

    QN3

    0

    1

    2

    3

    4

    5

    No

    . of

    Su

    po

    rtiv

    e S

    ub

    ject

    (s)

    1 2 3 4 5 6 7

    Question No.

    WBA is Better

    No Dif ference

    RA is Better

    Fig. 5. Experiment result of Part I with QN1, QN2 and QN3

    Fig. 6(a). Usability comparison result of WBA and RA

    0

    1

    2

    3

    4

    5

    No

    . of

    Su

    po

    rtiv

    e S

    ub

    ject

    (s)

    1 2 3 4 5 6 7

    Question No.

    WBA Better

    No Difference

    FAA is Better

    0

    1

    2

    3

    4

    5

    No

    . of

    Su

    po

    rtiv

    e S

    ub

    ject

    (s)

    1 2 3 4 5 6 7

    Question No.

    RA Better

    No Difference

    FAA is Better

    Fig. 6(b). Usability comparison result of WBA and FAA

    Fig. 6(c). Usability comparison result of RA and FAA

    Figure 5 shows the experiment result. Among the 11 (number of Unicode strings) *

    5 (number of subjects) = 55 answers, there are in total 11 questions correctly an-swered by the subjects of QN1, and 27 for QN2, and 29 for QN3. We can see that in all questions, subjects from QN2 and QN3 can do much better than the ones in QN1, and subjects from QN3 can do slightly better than the ones from QN2. Hence, we can conclude that coloring (RA and FAA) can do much better than none-coloring (WBA) on users’ language/script(s) usage understanding. We also conclude that FAA can do very similar to RA but slightly better. Our explanation is that we did not use many colors for all the 11 questions, so even random selection of colors can make a good result. Hence, FAA cannot show much advantage to RA. However, when a Unicode string contains many colors, FAA should perform much better than RA.

  • Exposing Homograph Obfuscation Intentions by Coloring Unicode Strings 285

    Part II, Usability Comparison for Different Coloring Approaches We request the subjects to compare the usability of different coloring approaches. QN1 compares WBA with RA, QN2 compares WBA with FAA, and QN3 compares RA with FAA.

    Figure 6 shows the experiment result. Figure 6(a) shows that WBA is relatively similar to RA. The reason could be that RA does not consider the background and the used colors. Hence RA could assign similar color(s) to other language/script(s) or background. Figure 6(b) shows that FAA is much better than WBA, and Figure 6(c) shows that FAA is better than RA. Hence, FAA is better than both WBA and RA. Therefore we conclude that a well designed coloring approach can help end users better understand language/script(s) usage in a Unicode string.

    6 Conclusion

    In this paper, we proposed to use Unicode string coloring as a solution to character-level obfuscation. We proposed two coloring schemes: Fixed coloring and Adaptive coloring. The fixed coloring scheme assigns a specific color to each language/script as its basic color to satisfy the security requirement and is therefore a good match for web address coloring. The adaptive coloring scheme calculates a coloring palette based on the given background, foreground colors, and the language/script composi-tion in real-time. The purpose of this coloring scheme is to provide an easy way for users to understand the language/script diversity in a Unicode string context. We prototyped the two coloring schemes in the Unicode String Illustrator, which contains the IRI Illustrator and the Context Illustrator. Our user study shows that even though coloring is a generally useful method to show the mixture of different languages/scripts and is useful for revealing the obfuscation intentions, only well designed coloring schemes can keep high usability and effectiveness for reading and understanding the colored Unicode strings.

    7 Future Work

    We have demonstrated that Unicode string coloring can help illustrate lan-guage/script(s) usage in IRIs and text paragraphs. However, it cannot illustrate the phishing intention of strings such as “www.citi-bank.com”. We plan to add this color-ing scheme into other phishing detection systems, such as REGAP [2], to make detec-tion of such kind of phishing intentions possible. In Unicode String Illustrator, Ver. 1.0, we do not use all of the 15 subgroups of CJK. We only considered two sub-groups: Simplified Chinese and the rest CJK characters. We leave the more thorough grouping work in our future work. In fact, languages/scripts that do not have similar characters can also share the same color. It is also of our future interest to save the number of colors by doing so.

    Acknowledgement

    The work described in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No.

  • 286 L. Wenyin, A.Y. Fu, and X. Deng

    CityU 117907] and the National Grand Fundamental Research 973 Program of China under Grant No. 2003CB317002. We would like to thank Yeung Wan Hang, Chau Kin Man, and Mak Sheung Man for their help in the experiments and user studies in this project.

    References

    1. Liu, W., Deng, X., Huang, G., Fu, A.Y.: An Anti-Phishing Strategy based on Visual Simi-larity Assessment. IEEE Internet Computing 10(2), 58–65 (2006)

    2. Fu, A.Y., Deng, X., Liu, W.: REGAP: A Tool for Unicode-based Web Identity Fraud De-tection Journal of Digital Forensic Practice 1(2), 83–97.(Special Edition on Anti-phishing and Online Fraud) (2006)

    3. Fu, A.Y., Deng, X., Liu, W., Little, G.: The Methodology and an Application to Fight against Unicode Attacks. In: Proceedings of SOUPS 2006, CMU, Pittsburgh, USA (July 2006)

    4. Gabrilovich, E., Gontmakher, A.: The Homograph Attack. Communications of the ACM 45(2), 128 (2002)

    5. ICANN, http://www.icann.org 6. Unicode Consortium, The Unicode Character Code Charts By Script, http://www.

    unicode.org/charts 7. BTPLC, Safe Web Colours for Colour-Deficient Vision, http://www.btplc.com/

    age_disability/technology/RandD/colours/colours1.htm 8. Riemersma, T.: Colour Metric, http://www.compuphase.com/cmetric.htm 9. Karp, R.: Reducibility among Combinatorial Problems. In: Proceedings of Symposium on

    the Complexity of Computer Computations (1972) 10. Krammer, V.: Phishing Defense against IDN Address Spoofing Attacks. In: Proceedings

    of the 4th Annual Privacy Security Trust Conference 2006 (PST 2006), October 2006, pp. 275–284. ACM Press, New York (2006)

    11. The UK Payment Association, http://www.apacs.org.uk 12. Computer Times, $8 Billion Lost to Online Scams, http://www.computertimes.

    com/oct06Articfle8BillionLostToOnlineScams.htm 13. CityU Coloring Palette,

    http://antiphishing.cs.cityu.edu.hk/ColoringScheme 14. Wu, M., Miller, R.C., Garfinkel, S.L.: Do Security Toolbars Actually Prevent Phishing At-

    tacks? In: Proceedings of SIGCHI 2006, pp. 601–610 (2006), http://groups. csail.mit.edu/uid/projects/phishing/chi-security-toolbar.pdf

    15. Macambira, E.M.: An Application of Tabu Search Heuristic for the Maximum Edge-Weighted Subgraph Problem. Annals of Operations Research 117, 175–190 (2002)

    Exposing Homograph Obfuscation Intentions by Coloring Unicode StringsIntroductionRelated WorkAnti-phishingHomograph ObfuscationColoring Techniques

    Unicode String ColoringCriteria for Coloring Scheme DesignUnicode Character GroupingColoring Palette ConstructionUnicode String Coloring

    Unicode String Coloring ToolIRI IllustratorContext Illustrator

    User StudyConclusionFuture WorkReferences

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 600 /GrayImageDepth 8 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.01667 /EncodeGrayImages true /GrayImageFilter /FlateEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 2.00000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /False

    /SyntheticBoldness 1.000000 /Description >>> setdistillerparams> setpagedevice