understanding mobile gui: from pixel-words to screen-sentences

Understanding Mobile GUI: from Pixel-Words toScreen-Sentences

Jingwen Fu∗Xi’an Jiaotong University

Xi’an, [email protected]

Xiaoyi Zhang∗University of Electronic Science and

Technology of ChinaChendu, China

[email protected]

Yuwang Wang†Microsoft Research Asia

Beijing, [email protected]

Wenjun ZengMicrosoft Research Asia

Beijing, [email protected]

Sam YangMicrosoft

Redmond, [email protected]

Grayson HilliardMicrosoft

Redmond, [email protected]

ABSTRACTThe ubiquity of mobile phones makes mobile GUI understandingan important task. Most previous works in this domain requirehuman-created metadata of screens (e.g. View Hierarchy) duringinference, which unfortunately is often not available or reliableenough for GUI understanding. Inspired by the impressive successof Transformers in NLP tasks, targeting for purely vision-based GUIunderstanding, we extend the concepts of Words/Sentence to Pixel-Words/Screen-Sentence, and propose a mobile GUI understandingarchitecture: Pixel-Words to Screen-Sentence (PW2SS). In analogy tothe individualWords, we define the Pixel-Words as atomic visualcomponents (text and graphic components), which are visuallyconsistent and semantically clear across screenshots of a largevariety of design styles. The Pixel-Words extracted from a screenshotare aggregated into Screen-Sentence with a Screen Transformerproposed to model their relations. Since the Pixel-Words are definedas atomic visual components, the ambiguity between their visualappearance and semantics is dramatically reduced. We are able tomake use of metadata available in training data to auto-generatehigh-quality annotations for Pixel-Words. A dataset, RICO-PW, ofscreenshots with Pixel-Words annotations is built based on thepublic RICO dataset, which will be released to help to address thelack of high-quality training data in this area. We train a detectorto extract Pixel-Words from screenshots on this dataset and achievemetadata-free GUI understanding during inference. We conductexperiments and show that Pixel-Words can be well extracted onRICO-PW and well generalized to a new dataset, P2S-UI, collectedby ourselves. The effectiveness of PW2SS is further verified in theGUI understanding tasks including relation prediction, clickabilityprediction, screen retrieval, and app type classification.

CCS CONCEPTS• Human-centered computing → Graphical user interfaces;Mobile phones; • Computing methodologies → Image repre-sentations; Neural networks.

∗Equal contributions during internship at Microsoft Research Asia.†Corresponding author

(a) (b)

Figure 1: (a) Visualization of our proposed Pixel-Words (la-beled with green boxes). (b) Compared to leaf nodes in VH(labeled with red boxes), Pixel-Words are more clean and vi-sually consistent across screenshots.

KEYWORDSGUI Understanding, Transformer, Detection

1 INTRODUCTIONAs mobile phones have become indispensable for human daily life,understanding GUI becomes a very important capability for AI toaccomplish such tasks as language navigation [16], task automa-tion [15, 33], reverse software engineering [1, 23], screen reader [5],etc. The metadata of screenshot, e.g. View Hierarchy (VH), pro-vides a tree structured description of the UI components forming thescreenshot. Most previous works [12][16][15] rely on the metadatato understand GUI. Unfortunately, the metadata is often noisy[33–35] due to the large varieties of platforms, third-party UI tookits,coding styles, etc [33]. What’s worse is that the metadata is often

arX

iv:2

105.

1194

1v1

[cs

.CV

] 2

5 M

ay 2

021

Conference’21, October 2021, Chengdu, China Jingwen Fu, Xiaoyi Zhang, Yuwang Wang, et al.

not accessible due to privacy or compliance issues. Understandingthe GUI from the screenshot only is a challenging and less-studiedarea. The first challenge is the complexity of screenshots. Differentfrom natural images, the screenshots consist of UI components ofa large number of categories. The UI components from differentcategories can be visually similar, and complex UI components maybe composed of other simpler UI components, e.g., list views canbe decomposed into icons and texts. The visual appearances of UIcomponents vary due to the large variety of UI design styles. Chenet al.[6] try to detect the non-text UI components from screenshotswith traditional low-level vision algorithms, e.g. boundary extrac-tion, but they lack semantic understanding and are not robust onscreenshots with complex layouts. Zhang et al. [33] manually an-notate the UI components of a self-collected private dataset andtrain their detector to detect those UI components. It is difficult toextend the human labelling method to larger scale data and there isa mismatch between human annotations and metadata for variouskinds of UI components. Besides, both works are limited to onlydetecting UI components from screenshots, without screen-levelunderstanding. Another challenge is the lack of high-quality anno-tated large scale datasets, which limits deep learning based methods.Liu et al. [19] aim to generate the annotations of UI componentsby parsing the metadata with hand-crafted rules, but the quality islimited by the noisy metadata.

Our work is inspired by the modeling ideas in NLP. Each indi-vidualWord is an atomic component with the essential semantics,which is modeled as a token. All the tokens in a Sentence are fedinto a Transformer to understand the whole Sentence. The key isto achieve the understanding of the whole sequence from the ba-sic units. This successful modeling framework can be extended tounderstand UI. In analogy to NLP, the “word" of a screen shouldbe: 1) atomic visual components carrying clear semantic meaning.2) visually consistent across different UIs. A VH based work [12]takes the leaf nodes in metadata as the words for screen. However,as shown in Figure 1 (b), those leaf nodes are very noisy and somenodes contains only icons or texts, while some other leaf nodescontain both. The confusing semantics of the leaf nodes would hurtthe understanding of individual components and the whole UI.

In this paper, we aim to achieve GUI understanding from pixels byextending the Word/Sentence concepts into the Pixel-Word/Screen-Sentence concepts, and propose a new architecture: from Pixel-Wordsto Screen-Sentence (PW2SS). Our PW2SS gets rid of the limitation ofrequiring metadata in inference and can be widely applied acrossdifferent platforms, UI design tools and styles. The key is to designthe Pixel-Words to have the property ofWords in NLP. We define thePixel-Words as atomic visual components of screens, which includeText Pixel-Words (text) extracted by OCR and Graphic Pixel-Words(icons and images) extracted by our Graphic Detector as shownin Figure 1. The appearances of those Pixel-Words are clean andconsistent across different screenshots. The benefits of our Pixel-Words definition are three-fold: 1) providing semantically effectivetokens into the Transformer to understand the Screen-Sentences.2) Enabling the OCR and Graphic Detector to extract Pixel-Wordsbased on the consistent visual appearance of these components.3) Making it possible to clean the noise in the metadata and getpseudo-labels of the Pixel-Words for the training of the detector.According to the above definition of Pixel-Words, we propose a

heuristic method to generate pseudo label for Pixel-Words and builda Pixel-Words dataset named RICO-PW based on the pulished RICOdataset [8]. For Screen-Sentence, we leverage BERT[10] to design aScreen Transformer to model the relation of Pixel-Words and trainthe Screen Transformer with masked Pixel-Words prediction.

To evaluate the effectiveness of our Pixel-Words and PW2SS, weconduct experiments on RICO-PW and a new GUI understandingdataset, P2S-UI, collected by ourselves. Our method achieves thebest performance on the GUI understanding tasks including Pixel-Words extraction, relation prediction, clickability prediction, screenretrieval, and app type classification.

Our main contributions can be summarized as follows:• We propose a pixel-based GUI understanding frameworkfrom pixels which is suitable for general applications acrossdifferent platforms, UI design tools and styles.

• By defining the Pixel-Words as atomic text and graphic com-ponents, we make them visually consistent and semanticallyclear tokens for the Screen Transformer.

• We build a high-quality UI datasets RICO-PW with Pixel-Words annotations, which will be released to the public.

2 RELATEDWORKGUI Understanding is a challenging area which has attracted in-creasing attention recently due to the prevalence and importanceof smart devices. One important task is extracting elements fromscreenshots and learning the representations of the componentsand the whole screen. The related works can be divided into twobranches: purely vision-based methods which only take the screen-shot as input and metadata-based methods which require the meta-data of the screenshot as an extra input. The first branch [6, 27, 33]tries to extract UI components from screenshots with detectiontechniques, but they are limited to component level understanding,and lack the understanding of the relation between componentsand the whole screen. Besides, their detection ground truth areextracted from metadata directly, and the visual ambiguity of dif-ferent UI components may hurt the detection performance. Thesecond branch achieves screen level understanding with the help ofmetadata. Seq2act [16] retrieves elements in screen using structuredquery. ActionBert [12] and Screen2Vec [15] learn the embeddingsof elements and screens. However, these methods depend on themetadata, which are not universally accessible and often suffer fromnoise.Transformer-based representation learning aims to model therelations of tokens with Transformers and provide a global under-standing of the The Transformer is proposed in [28] and BERT[10] isa well-accepted pretraining method for transformer. Recently, manyworks apply Transformers to visual language tasks [13, 14, 21, 26].Our work is different from these works in that they require extratext information, e.g. image caption, for each image, but our textinformation is extracted from screenshots by the OCR. The mostrelated work to ours is LayoutLM [30, 31] for form understanding.The main difference is that the graphic components like icons andimages are more important in UI understanding, while LayoutLMmainly considers the text in the scanned document.Mobile screen related applications focus on applying computervision techniques in mobile UIs to solve particular tasks. Due to

Understanding Mobile GUI: from Pixel-Words to Screen-Sentences Conference’21, October 2021, Chengdu, China

OCR

Graphic

Detector

LibraryBack shell

Graphic

Classifier

back menu other other other

Graphic Pixel-Words

pos pos pos pos pos pos pos pos+ + + + + + + +

Layout

Encoder

Screen Transformer

Layout

Embedding

Relation

Prediction

Masked Pixel-Words

Prediction

Clickability

Prediction

Screen

Retrieval

App Type

Classification

Text Pixel-Words

Figure 2: Overview of our method. Given a screenshot, we first generate Pixel-Wordswith an OCRmodule to extract Text Pixel-Words and an Graphic Detector and Graphic Classifier to extract Graphic Pixel-Words. We also use a Layout Encoder to get aLayout Embedding proving a global layout information. All the Pixel-Words and the layout embedding are fed into the ScreenTransformer to model the Screen-Sentence.

the ubiquity of the mobile phone, there are a lot of applicationsbeing explored. Reverse software engineering [1, 3, 22, 23] aims togenerate the code of UI from the corresponding screenshot. Designassistant [2, 4, 37] help to design a GUI of apps. Assistants for thevisually impaired [5] aims to generate description of the element.The previous works [7, 20] detect the displaying issue in mobilescreens with computer vision technique. These papers mainly fo-cus on the specific application scenario. Those tasks can be betteraccomplished with a good understanding of the screens and can beregarded as the downstream tasks of our work.Mobile screen related datasets are very important for the train-ing of models of GUI understanding. For deep learning based meth-ods, large scale datasets are necessary for reliable performanceand generalizability. RICO [8] is one of the most important publicdatasets with both screenshots dataset with corresponding meta-data in this area. However, it only contains some simple auto-generated labels without semantics. Liu et al. [19] propose a wayto generate the semantic labels from the metadata, but it is stillnoisy and is not suitable for the purpose of GUI understandingfrom pixels. In this paper, we generate high quality annotations ofPixel-Words for GUI understanding tasks.

3 FROM PIXEL-WORDS TOSCREEN-SENTENCES

3.1 Overview of PW2SSTo understand the GUI from pixels, we extent the concepts ofword/Sentence in NLP to Pixel-Words/Screen-Sentences. Differentfrom NLP, where the Word is already isolated, for GUI understand-ing, we need to first extract Pixel-Words from the given screenshot,then feed the Pixel-Words into our Screen Transformer as shownin Figure 2. We first train the models to extract high-quality Pixel-Words and then train the Screen Transformer.

In analogy to Words, the Pixel-Words should be isolated visualcomponents and carry elementary semantics. Screenshot are typi-cally rendered from metadata, e.g. View Hierarchy (VH), describingthe hierarchical structure of UI components in the screenshot. (Ex-amples of VH data are shown in our Supplementary Material) Onestraightforward way to isolate “words” of a screen is to use the leafnodes in VH as Pixel Words. However, as shown in Figure 1, dueto varieties of UI tools and coding styles, there are many invalidleaf nodes and the content in the leaf node varies across differentscreens. We define the Pixel-Words as the visual atomic components,which make up the screen. As Figure 1 (a) shows, our Pixel-Wordsinclude text components (e.g. text or blocks of texts) and graphiccomponents (e.g. icons and images), denoted as Text Pixel-Wordsand Graphic Pixel-Words respectively. Those Pixel-Words are bothvisually consistent and semantically carrying meaningful informa-tion.

Based on the Pixel-Words, the Screen Transformer is able to un-derstand the whole screen. We refer to BERT [10] to add positionembeddings to represent the locations of the Pixel-Words and use themasked Pixel-Words prediction task to pretrain the Screen Trans-former. Furthermore, we design a layout embedding to provideglobal layout information of screen. Then the pretrained ScreenTransformer can be finetuned to accomplish various downstreamtasks like relation prediction, clickability prediction, screen retrievaland app type classification etc.

3.2 Pixel-Words Extraction and UnderstandingWe extract the Pixel-Words from a screenshot and transfer theminto tokens based on different appearances of Text Pixel-Wordsand Graphic Pixel-Words as shown in Figure 2. For the Text Pixel-Words, we use an off-the-shelf OCR tool1 to extract the location andtext content, then feed the text into Sentence-BERT[24] to get the

1https://developers.google.com/ml-kit/vision/text-recognition


Methods Cleaning Graphic Pixel-Word Text Pixel-WordRecall Precision Recall Precision

Liu et al. w/o 0.73 0.71 0.69 0.62Liu et al. w/ 0.72 0.75 0.69 0.69Ours w/o 0.80 0.94 0.86 0.97Ours w/ 0.85 0.95 0.95 0.97

Table 1: Comparison the labels generated from themetadatausing our method and Liu et al. [19] on 200 screenshots la-belled by us. The "cleaning" means the operation of remov-ing the screenshots with metadata.

representation of tokens. For the Graphic Pixel-Words, we designa Graphic Detector and Graphic Classifier to extract the locationand semantics. The Graphic Detector detects the graphic elements,then we use a Classifier to identify the semantic meaning of theelements, and use the BERT embeddings of semantic labels as thetokens.

To train the Graphic Detector with less human labeling cost, wemake use of VH in the RICO dataset[8, 19]. The biggest challenge forgetting supervision for UI components is that it is hard to filter outinvalid nodes using visual features. However, our clear definition ofGraphic Pixel-Words shows great advantages here. ForGraphic Pixel-Words, we only focus on the atomic graphic components in the UIwhich have consistent appearance and can be easily distinguishedfrom other complex UI components. The details are shown in theSupplementary Material.

Aaccording to our definition of Pixel-Words, a baseline method isto leverage the component annotations provided by Liu et al. [19]on RICO. We manually choose the components labelled with text-related categories as our Text Pixel-Words, and choose the compo-nents labelled with icon- or image-related categories as our GraphicPixel-Words. However, the Text Pixel-Words collected in this wayare still very noisy due to the low quality annotations of Liu etal. [19]. To obtain a large number of Pixel-Words supervision withlow cost of human effort, we carefully design a pipeline to extractannotation from VH. For text, we first extract nodes in VH with UIclass names related to "text", then we use the OCR tool to localizetext inside these nodes. For graphics, we generate candidates ac-cording to location of texts and nodes in VH, then we use a binaryclassifier to identify each candidate is a graphic or not. We collect1239 candidate patches from screenshots and manually label themas positive or negative samples. The binary classifier is trained onthis dataset and demonstrates high accuracy performance. We fur-ther propose a “cleaning” operation to filter out invalid screenshotswhere there is an obvious mismatch between the number of textboxes extracted using OCR or metadata.

To verify the effectiveness of the generated labels using ourmethod, we manually label the Pixel-Words in 200 screenshots ran-domly sampled from RICO and use them as ground truth. As Table 1shows, our generated labels for Pixel-Words achieve higher Recalland Precision than the leaf nodes of VH.

For further understanding the extracted graphics, we use aGraphicClassifier to recognize the semantic meaning of each graphics, asshown in Figure 2. To train the Graphic Classifier, we build a highquality graphic dataset, named RICO-ICON, by refining category

Method MetricAR AP AP50 AP75

Leaf Nodes 0.665 0.575 0.720 0.611Pixel-Words 0.780 0.636 0.835 0.746

Table 2: Comparison of the detection results between leafnode and Pixel-Words on RICO-PW.

models Val TestAR AP AP50 AP75 AR AP AP50 AP75

RTN-S 0.720 0.613 0.753 0.658 0.720 0.609 0.748 0.654ATSS-S 0.716 0.587 0.782 0.629 0.718 0.587 0.778 0.628FA-S 0.756 0.656 0.756 0.705 0.750 0.645 0.749 0.693FA-D 0.790 0.678 0.841 0.725 0.792 0.676 0.837 0.722

Table 3: Comparison of different detectors’ performance onRICO-PW validation and test splits. AR is short for Aver-age Recall. RTN-S, ATSS-S, FA-S denotes RetinaNet, ATSS,FreeAnchorwith ResNet50[11] as the backbone respectively.FA-D denotes FreeAnchor with ResNeXt101[29] as back-bone.

annotations from RICO. Based on the observation that there ex-ists a serious long-tail problem of icon categories in the originalRICO dataset, we only collect the most important categories inRICO-ICON. We use the clicking frequency as a metric to measurethe importance of graphic categories and select 31 categories withthe highest clicking frequency as the graphic categories in RICO-ICON. The remaining categories of graphic are assigned an "other"category label.

3.3 Screen-Sentence UnderstandingTo accomplish tasks requiring semantic understanding of the wholescreen, we take the Pixel-Words from the previous stage as tokensand feed them into the Screen Transformer to model their relationand form a Screen-Sentence. We design our Screen Transformer as a6-layer transformer architecture referring to BERT [10]. As Figure 2shows, the input of the transformer consists of three parts: Pixel-Words embeddings, corresponding 2-D position embeddings and alayout embedding. Pixel-Words embeddings represent the sematicsof these atomic visual components. The embeddings of Text Pixel-Words and Graphic Pixel-Words are processed by a linear layer toensure they have the same size. Position embeddings are added toeach corresponding Pixel-Words to provide spatial position. Follow-ing the same setting as LayoutLM[30], we represent the position andsize of Pixel-Words in screenshot as (𝑥𝑚𝑖𝑛, 𝑦𝑚𝑖𝑛, 𝑥𝑚𝑎𝑥 , 𝑦𝑚𝑎𝑥 ,𝑤, ℎ),where 𝑤 and ℎ denote the width and height of the Pixel-Wordsrespectively. A layout embedding is designed to provide the globallayout information of the components. We follow [19] to generatelayout representation using the bounding boxes of our Pixel-Words.

The Screen Transformer is pretrained with a self-supervisedtask: masked Pixel-Words prediction. We select four downstreamtasks covering the understanding from Pixel-Words level to Sentence-Screen level. Those tasks can be well used to evaluate the under-standing performance and are practically important.Pretraining Task: Mask Pixel-Words prediction Inspired byBERT[10], where the input tokens are randomly masked and pre-dicted from the remaining tokens, we design themasked Pixel-Wordsprediction task to regress the embedding of the masked Pixel-Words.


Pixel-Words GTLeaf node detection Pixel-Words detectionLeaf node GT

Figure 3: Visualization of leaf nodes GT (magenta), leaf nodes detection results (blue), Pixel-Words ground-truth (red) andPixel-Words detection results (green).

The purpose is to force the transformer to learn the dependenciesamong Pixel-Words in the screen. More specifically, we randomlymask 15% of the total Pixel-Words fed into the Screen Transformer,and take the objective fucntion as ℓ2 norm between the maskedPixel-Words and predicted Pixel-Words.Downstream task #1: Clickability Prediction The task is topredict whether the Pixel-Word is clickable or not. A Pixel-Word isclickable when it is a button or part of a button. We use a classifier(a 3-layer MLP) to determine the clickability of the Pixel-Word fromthe embedding output by the Screen Transformer.Downstream task #2: Relation Prediction The relation predic-tion task is to predict the pairwise relation between Pixel-Wordpairs. The relations reflect how semantically related of the twoPixel-Words. For example, as shown in Figure 4 (b), the WiFi iconand the texts "WiFi" nearby have the same semantic meaning andthey should have close relation. We first sum the embeddings oftwo Pixel-Words output from the Screen Transformer and feed theresult into a 3-layer multi-layer perceptron (MLP) to predict therelation of the two Pixel-Words. The objective function is the crossentropy between the predicted relation and the ground truth.Downstream task #3: Screen Retrieval In analogy to sentenceretrieval in NLP, screen retrieval task aims to find the screenshotsthat are the most similar to the query screenshot in terms of boththe semantics of the content and the layout of the structure.

Setting IoU CenterAR AP AP50 AP75 Recall AP

ImageNet pretraining 0.693 0.601 0.899 0.676 0.963 0.922RICO-PW pretraining 0.714 0.634 0.916 0.741 0.961 0.938

Table 4: The detecting results ofGraphic Pixel-Words on P2S-UI with ImageNet or RICO-PW pretraining. IoU and Centermetrics are used for evaluation.

To accomplish screen retrieval, a good understanding of thewhole screen is required. To obtain a representation for a givenscreenshot, we apply max pooing on all the outputs of the ScreenTransformer. Then we find the closest screenshots for the queryscreenshot using the cosine similarity of the representations of thescreenshots.Downstream task #4: App Classification The app type classifi-cation task is to recognize the app type for each screenshot. We alsouse a maxpooling operation to aggregate all the output embeddingfrom Screen Transformer to get the representation of the givenscreen. Then we feed the representation into a classifier (a 3-layerMLP) to predict the app type of the screen.


Figure 4: Visualization of relations prediction results of PW2SS on P2S-UI. A line is shown to connect two Pixel-Words withrelation. The predicted results is labeled with blur lines and ground truth is labeled with red lines.

Category top1 acc Category top1 accadd 0.944 favorite 0.962

arrow_backward 0.995 filter 0.880arrow_downward 0.981 gallery 0.930arrow_forward 0.977 location 0.940arrow_upward 0.876 menu 0.985

avatar 0.904 microphone 0.945calendar 0.897 more 0.978

call 0.930 other 0.929camera 0.900 pause 0.855cart 0.947 play 0.974chat 0.937 question_mark 0.968check 0.966 refresh 0.965close 0.951 search 0.941delete 0.896 send 0.941

download 0.998 settings 0.905edit 0.942 share 0.990

Average 0.957Table 5: Top1 accuracy of our Graphic Classifier trained onRICO-ICON.

4 EXPERIMENT4.1 Experimental SetupDatasets RICO[8, 19] is a public dataset on mobile UI. Thereare totally 66,261 screenshots with corresponding metadata. Thedataset is first proposed by Deka et al.[8] with screenshots and VHfiles only and Liu et al.[19] add semantic labels for every nodes ofVH. RICO-PW is a dataset built by us for Pixel-Words detectionbased on the public RICO dataset. (see section 3.2) We take 67.5%,7.5% and 25% of screenshots for graphic detector’s training, vali-dation and testing respectively. During pretraining, we combinethe training set and test set to obtain a better pretrained detector.P2S-UI is a dataset collected by ourselves. This dataset contains27077 mobile UI images, where 5556 images are annotated withPixel-Words manually. There also are the annotations for relation,clickablility and app type for each screenshot. RICO-ICON is an

icon dataset consisting of 120,067 icons cropped from the RICOdataset. The icons are divided into 32 categories.Baselines Leaf nodes are used as the input of Transformer inthe previous work [12], which can be regarded as the baseline ofPixel-Words. When extracting the leaf nodes, we use heuristic rules(e.g. spatial size and ratio of the nodes) to filter out some invalid leafnodes. W/o pretraining is the baseline that the models trainedwithout the RICO-PW dataset to study the impact of pretraining onthe RICO-PW dataset. W/o Screen Transformer is the baselineto study the impact of our proposed Screen Transformer.Metrics To evaluate the performance of Pixel-Words detection, weuse Intersection over Union (IoU)-based detection metrics, whichincludes Average Recall (AR) of COCO-style [18], Average Preci-sion (AP) of COCO-style, Average Precision with 0.50 IoU threshold(AP50), and Average Precision with 0.75 IoU threshold (AP75). Wealso use "Center" metric referring to [33]. Given a predicted bound-ing box, the criteria of true or false is whether the center of thebounding box is inside the ground-truth box or not. For graphicclassification, clickability prediction and relation prediction tasks,we use top1 accuracy as our metric.

4.2 Pixel-Words EvaluationIn this section, we evaluate the extracting of Pixel-Words fromscreenshots on RICO-PW and P2S-UI.Study of Different Detection Models We select several typicalone-stage detectors (for efficient inference) equipped with FPNto study the impact of different detectors on RICO-PW dataset.RetinaNet[17] is a generic one-stage detector, which is equippedwith focal loss to solve the imbalance between positive and neg-ative samples and assign labels on the basis of hand-crafted IoUthreshold. ATSS[32] proposes a new mechanism which can adap-tively adjust the IoU threshold to assign label. FreeAnchor[36]treats the label assignment problem as a maximum estimation prob-lem, which achieves the "learning to assign" during the trainingprogress. Table 3 shows results from above detectors. FreeAnchorobtains the highest AR and AP score, which means it performs wellon both recognition and localization. One possible reason is that


Shopping

Map

Life

Figure 5: Visualization of the app type classification results on P2S-UI. The screenshots of the top/middle/bottom rows areclassified as Shopping/Map/Life types respectively.

different from natural images, there are usually a lot of blank ortexture-less backgrounds in the target bounding boxes in screen-shots. FreeAnchor can better handle this problem and find theinformative anchors to represent the detection target. ThereforeWe use the FreeAnchor detector in following experiments.Pixel-Words v.s. Leaf Nodes We study which type of "word" ofscreenshots (Pixel-Words or leaf nodes) is easier to be extracted.The Pixel-Words and leaf nodes are learnt with the supervision ofthe cleaned leaf nodes and Pixel-Words annotations respectively onRICO-PW. Visualized comparison are shown in Figure 3. Comparedto leaf nodes, as Table 2 shows, our Pixel-Words gain 17.29%, 10.60%,15.87%, 22.09% on the metrics of AR, AP, AP50, AP75 respectively.The result supports our assumption that Pixel-Words are easier tobe extracted from pixels than leaf nodes and better fit our PW2SSframework.Study of pretraining onRICO-PW We take our RICO-PWdatasetas a pretraining dataset and finetune the detector on the humanlabeled P2S-UI dataset. Here we take the common ImageNet [9]pretraining as baseline. Table 4 shows our results. Compared toImageNet pretraning, our RICO-PW pretraining gains 3.03%, 5.49%,1.89%, 9.62% on AR, AP, AP50, AP75 respectively. The biggest im-provement comes from AP75, which means the pretraining helpsto localize graphics more precisely.Study of Graphic Classifier The graphics play an important rolein GUI, which can provide rich semantic meaning in limited pix-els. We try to recognize the graphics as fine-grained categories torepresent the semantic meanings. As mentioned in Section 3.2, wereorganize the categories from original RICO dataset and train ourgraphic classifier on that. Here we use MobilenetV2 [25] as ourGraphic Classifier. We report our category-level top1 accuracy inTable 5. The results show our Graphic Classifier can get rich infor-mation from our dataset. We achieve 0.975 average top1 accuracyof 32 categories.

4.3 PW2SS EvaluationWe first pretrain the Screen Transformer with the masked Pixel-Words prediction task, then we evaluate the effectiveness of PW2SSon Pixel-Word level tasks (relation prediction and clickability predic-tion) and Screen-Sentence level tasks (screen and app classification).Pretraining on RICO-PW Before we train our Screen Trans-former in downstream tasks, we first pretrain it on the RICO-PWdataset. The goal of pretraining is to force the Screen Transformerto model relations between Pixel-Words using the task of Mask Pixel-WordsModel. For each screenshot, 15% of Pixel-Words are randomlymasked. We pretrain the model for 50 epochs, and using 5 epochfor warming up. We use the AdamW optimizer with learning rate𝑟 = 10−4, 𝛽1 = 0.9,𝛽2 = 0.999, 𝜖 = 10−6 and batch size of 64 fortraining.Clickability Prediction In this task, we add a classifier head onthe output feature of the Screen Transformer to predict whethera Pixel-Word is clickable or not. Table 6 shows that the with theScreen Transformer, we can achieve better clickability results usingthe context of Pixel-Words. Besides, the pretraining on RICO-PWhelps to further improve the accuracy, which indicates that themasked Pixel-Words task is effective to learn a better representationby modeling the relation between Pixel-Words.Relation Prediction Relation prediction task is designed to eval-uate the performance of our Screen Transformer on the tasks aboutPixel-Words pairs. Examples of the relation are shown in Figure 4.As Table 6 shows, similar to the clickability prediction task, boththe Screen Transformer and pretraining are effective for these tasks.An interesting observation is that, compared to the clickabilityprediction task, the gain caused by Screen Transformer in rela-tion prediction task is larger. The reason is that relation predictiontask relies more on the context of Pixel-Words, where the ScreenTransformer can provide more useful information.


Query

Top 1 Top 2 Top 3

PW2SS

Liu et al.

PW2SS

Liu et al.

Top 1 Top 2 Top 3

PW2SS

Liu et al.

PW2SS

Liu et al.

Query

Query

Query

Top 1 Top 2 Top 3 Top 1 Top 2 Top 3

Figure 6: Comparison of the retrieval results of our PW2SS and Liu et al. [19] on P2S-UI. The retrieved screenshots of PW2SSare more similar to the query screenshots in terms of both semantics and layout.

Setting CP RPw/o Screen Transformer 0.871 0.906

w/ Screen Transformer (w/o pretraining) 0.896 0.947w/ Screen Transformer (w/ pretraining) 0.910 0.965

Table 6: Results of Clickability Prediction (CP) and RelationPrediction (RP) on P2S-UI. We study the impact of usingScreen Transformer and pretraining on RICO-PW.

Setting model accuracy

Leaf Node w/o Screen Transformer 0.914w/ Screen Transformer 0.915

Pixel-Wordsw/o Screen Transformer 0.948w/ Screen Transformer 0.955

Table 7: App type classification results on P2S-UI. We studythe impact of different inputs of Screen Transformer: Pixel-Words v.s. Leaf Nodes, and the impact of Screen Trans-former.

App Type Classification All the screenshots in P2S-UI datasetare divided into the 26 categories according to the app type (e.g.SHOP, SOCIAL). We learn to predict the app type for each Screen-Sentence. As Table 7 shows, our Pixel-Words outperform leaf nodeswith more than 3% gain, which verifies the effectiveness of ourdefined Pixel-Words in screen-level understanding. The qualitativeresults are shown in Figure 5.

Screen Retrieval Screen Retrieval is a well-applied task to eval-uate the models’ ability to learn the representation of the wholescreen. Given the query screen, we will retrieval We compare ourmodel to the method[19], which uses a autoencoder of the screenlayout to retrieval screens. As Figure 6 shows, the screen embeddinggenerated by Screen Transformer is more suitable to retrieve thescreens having similar semantic meaning as well as spatial layout.

5 CONCLUSIONIn this work, we propose a pixel-based GUI understanding frame-work, PW2SS, which is suitable for general applications across dif-ferent platforms, UI design tools and styles.We extend theWord/Sentenceconcepts into the Pixel-Word/Screen-Sentence concepts in the GUIunderstanding area. Pixel-Words are defined as atomic componentswith essential and clear semantics. Based on the definition of thePixel-Words, we can use OCR and graphic detector to extract thePixel-Words from the screenshots. Then a Screen Transformer isproposed to model the relation between Pixel-Words. The effective-ness of PW2SS is verified in tasks including Pixel-Words extraction,relation prediction, clickability prediction, screen retrieval, and apptype classification. We can also extend our work to understand-ing screenshot sequences of user accomplishing various tasks byaggregating Screen-Sentences into Task-Paragraphs.


REFERENCES[1] Tony Beltramelli. 2018. pix2code: Generating code from a graphical user inter-

face screenshot. In Proceedings of the ACM SIGCHI Symposium on EngineeringInteractive Computing Systems (EICS 2018).

[2] Sara Bunian, Kai Li, Chaima Jemmali, Casper Harteveld, Yun Fu, and Magy SeifEl-Nasr. 2021. VINS: Visual Search for Mobile User Interface Design. (2021).

[3] Chunyang Chen, Ting Su, Guozhu Meng, Zhenchang Xing, and Yang Liu. 2018.From UI design image to GUI skeleton: a neural machine translator to bootstrapmobile GUI implementation. In Proceedings of the 40th International Conferenceon Software Engineering (ICSE 2018).

[4] Jieshan Chen, Chunyang Chen, Zhenchang Xing, Xin Xia, Liming Zhu, JohnGrundy, and Jinshui Wang. 2020. Wireframe-based UI design search throughimage autoencoder. ACM Transactions on Software Engineering and Methodology(TOSEM) 29, 3 (2020).

[5] Jieshan Chen, Chunyang Chen, Zhenchang Xing, Xiwei Xu, Liming Zhut, Guo-qiang Li, and JinshuiWang. 2020. Unblind your apps: Predicting natural-languagelabels for mobile GUI components by deep learning. In 2020 IEEE/ACM 42nd In-ternational Conference on Software Engineering (ICSE). IEEE.

[6] Jieshan Chen, Mulong Xie, Zhenchang Xing, Chunyang Chen, Xiwei Xu, LimingZhu, and Guoqiang Li. 2020. Object detection for graphical user interface: oldfashioned or deep learning or a combination?. In Proceedings of the 28th ACMJoint Meeting on European Software Engineering Conference and Symposium onthe Foundations of Software Engineering.

[7] Nathan Cooper, Carlos Bernal-Cárdenas, Oscar Chaparro, Kevin Moran, andDenys Poshyvanyk. 2021. It Takes Two to Tango: Combining Visual and TextualInformation for Detecting Duplicate Video-Based Bug Reports. arXiv preprintarXiv:2101.09194 (2021).

[8] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan,Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A mobile app datasetfor building data-driven design applications. In Proceedings of the 30th AnnualACM Symposium on User Interface Software and Technology (UIST 2017).

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: ALarge-Scale Hierarchical Image Database. In Proceedings of the IEEE internationalconference on computer vision (CVPR2009).

[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. [n.d.]. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies(NAACL-HLT).

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In Proceedings of the IEEE conference on computervision and pattern recognition.

[12] Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wich-ers, Gabriel Schubiner, Ruby Lee, and Jindong Chen. 2021. ActionBert: LeveragingUser Actions for Semantic Understanding of User Interfaces. (2021).

[13] Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020.Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers.CoRR abs/2004.00849 (2020). arXiv:2004.00849

[14] Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl:A universal encoder for vision and language by cross-modal pre-training. InProceedings of the AAAI Conference on Artificial Intelligence(AAAI2020), Vol. 34.

[15] Toby Jia-Jun Li, Lindsay Popowski, Tom M Mitchell, and Brad A Myers. 2021.Screen2Vec: Semantic Embedding of GUI Screens and GUI Components. arXivpreprint arXiv:2101.11103 (2021).

[16] Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020. Map-ping Natural Language Instructions to Mobile UI Action Sequences. In AnnualConference of the Association for Computational Linguistics (ACL 2020).

[17] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.Focal loss for dense object detection. In Proceedings of the IEEE internationalconference on computer vision(ICCV2017).

[18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, DevaRamanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Commonobjects in context. In European conference on computer vision. Springer.

[19] Thomas F Liu, Mark Craft, Jason Situ, Ersin Yumer, Radomir Mech, and RanjithaKumar. 2018. Learning design semantics for mobile apps. In Proceedings of the31st Annual ACM Symposium on User Interface Software and Technology (UIST2018).

[20] Zhe Liu, Chunyang Chen, Junjie Wang, Yuekai Huang, Jun Hu, and Qing Wang.2020. Owl Eyes: Spotting UI Display Issues via Visual Understanding. In 2020 35thIEEE/ACM International Conference on Automated Software Engineering (ASE).IEEE.

[21] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: PretrainingTask-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.(2019).

[22] Kevin Patrick Moran, Carlos Bernal-Cárdenas, Michael Curcio, Richard Bonett,and Denys Poshyvanyk. 2018. Machine learning-based prototyping of graphicaluser interfaces for mobile apps. IEEE Transactions on Software Engineering (2018).

[23] Tuan Anh Nguyen and Christoph Csallner. 2015. Reverse engineering mobileapplication user interfaces with remaui (t). In 2015 30th IEEE/ACM InternationalConference on Automated Software Engineering (ASE). IEEE.

[24] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddingsusing Siamese BERT-Networks. In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9th International Joint Conferenceon Natural Language Processing (EMNLP-IJCNLP).

[25] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recognition(CVPR 2018).

[26] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020.Vl-bert: Pre-training of generic visual-linguistic representations. (2020).

[27] Xiaolei Sun, Tongyu Li, and Jianfeng Xu. 2020. UI Components RecognitionSystem Based On Image Understanding. In 2020 IEEE 20th International Conferenceon Software Quality, Reliability and Security Companion (QRS-C 2020). IEEE.

[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all youneed. In Proceedings of the 31st International Conference on Neural InformationProcessing Systems (NeurIPS 2017).

[29] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017.Aggregated residual transformations for deep neural networks. In Proceedings ofthe IEEE conference on computer vision and pattern recognition (CVPR 2017).

[30] Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020.Layoutlm: Pre-training of text and layout for document image understanding.In Proceedings of the 26th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining.

[31] Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu,Dinei Florencio, Cha Zhang, Wanxiang Che, et al. 2020. LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding. arXiv preprintarXiv:2012.14740 (2020).

[32] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. 2020. Bridgingthe gap between anchor-based and anchor-free detection via adaptive trainingsample selection. In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition (CVPR 2020).

[33] Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin, Samuel White, Kyle Murray,Lisa Yu, Qi Shan, Jeffrey Nichols, Jason Wu, Chris Fleizach, et al. 2021. ScreenRecognition: Creating Accessibility Metadata for Mobile Applications from Pixels.In The 2021 ACM Conference on Human Factors in Computing Systems (CHI 2021).

[34] Xiaoyi Zhang, Anne Spencer Ross, Anat Caspi, James Fogarty, and Jacob OWobbrock. 2017. Interaction proxies for runtime repair and enhancement ofmobile application accessibility. In Proceedings of the 2017 CHI conference onhuman factors in computing systems (CHI 2017).

[35] Xiaoyi Zhang, Anne Spencer Ross, and James Fogarty. 2018. Robust annotation ofmobile application interfaces in methods for accessibility repair and enhancement.In Proceedings of the 31st Annual ACM Symposium on User Interface Software andTechnology (UIST 2018).

[36] Xiaosong Zhang, Fang Wan, Chang Liu, Xiangyang Ji, and Qixiang Ye. 2021.Learning to match anchors for visual object detection. IEEE Transactions onPattern Analysis and Machine Intelligence (2021).

[37] Tianming Zhao, Chunyang Chen, Yuanning Liu, and Xiaodong Zhu. 2021.GUIGAN: Learning to Generate GUI Designs Using Generative Adversarial Net-works. arXiv preprint arXiv:2101.09978 (2021).

Supplementary Material of Understanding Mobile GUI: fromPixel-Words to Screen-Sentences

Anonymous authorsPaper under double-blind review

{"text": "Buy Tickets","resource-id": "org.trimet.mt.mobiletickets:id/trimet_toolbar_title","adapter-view": false,"pointer": "401ed54","scrollable-horizontal": false,"ancestors": [

"android.widget.TextView","android.view.View","java.lang.Object"

],"selected": false,"content-desc": [

null],"rel-bounds": [

70,55,915,140

],"draw": false,"focusable": true,"long-clickable": false,"visibility": "visible","focused": false,"clickable": false,"abs-pos": true,"font-family": "sans-serif","class": "android.support.v7.widget.AppCompatTextView","visible-to-user": true,"package": "org.trimet.mt.mobiletickets","enabled": true,"bounds": [

322,139,1167,224

],"pressed": "not_pressed","scrollable-vertical": false

}

Figure 1: A typical node data in View Hierarchy (VH) fromRICO dataset.Method

detected text

graphic proposals

generated graphic

binary classifier

detected text and its parent node

Figure 2: The visualization of progress to generate missinggraphic in View Hierarchy (VH).

DETAILS OF PIXEL-WORDS ANNOTATIONGENERATIONTo solve the invalid node problem in VH, motivated by “proposal-classification”[? ? ? ] style detection method. We try to extract

proposals from the VH then use a classifier to identify each proposalis a graphic or not.

To train such a binary classifier, we collect 1273 patches of VHnodes from the original screenshots of RICO. 694 of them are labeledas positive samples and the rest 539 patches are labeled as negativesamples. After training, our classifier achieves 0.95 accuracy.

However, there are two challenges here. First, there are a hugenumber of nodes in VH, if we take every node as a proposal it willbe pretty inefficient and increase the number of failure cases forour classification. Second, about 28% of the graphics don’t havecorresponding nodes in VH, so it means the upper bound of re-call will be very limited. To tackle the first challenge, we create acandidate set of node class names that covers all graphics-relatedcategories. When we sample nodes from VH, only the node whoseclass name can be find in the candidate set will be considered aspossible proposals for graphics. For the second problem, we ob-served that the missing graphic nodes often associate with texts asFigure 1 shows. Therefore we generate new proposals accordingto the texts and their parent nodes. Specifically, for a text and itsparent node, we select the regions between the text and its parentnode’s top/bottom/left/right boundaries as proposals respectively,as shown in Figure 2. The methods solving the first and secondchallenge helps to improve the precision and recall respectively.

1 # nodes : L i s t [ D i c t ] , a l l nodes da t a in a VH f i l e2 # c l s n ame_ c and i d a t e s : S e t ( s t r ) , a s e t which i n c l u d e s

p o s s i b l e g r aph i c c l a s s in VH .3 # t e x t _ b box e s : L i s t [ L i s t [ i n t ] ] , the l i s t o f t e x t ' s

bounding boxes in c u r r e n t s c r e e n s h o t s .45 p r opo s a l s = [ ]6 # s t e p 1 : f i l t e r out u n r e l a t e d nodes .7 f o r node in nodes :8 i f node [ " c l a s s " ] i n c l s n ame_c and i d a t e s or node [ "

a n c e s t o r s " ] i n c l s n ame_ c and i d a t e s :9 p r opo s a l s . append ( node [ " bound " ] )1011 # s t e p 2 : add spa c ed_ r eg i on between t e x t and i t s p a r en t

node .12 f o r t e x t in t e x t _ b box e s :13 pa r en t = ge t_pa ren t_node_bbox ( t e x t )14 p r opo s a l s += g en e r a t e _ s p a c e d _ r e g i on ( parent , t e x t )1516 # s t e p 3 : use the b ina ry c l a s s i f i e r to e v a l u a t e the

p r o po s a l s .17 g r aph i c s _bboxe s = [ ]18 f o r p r opo s a l i n p r o po s a l s :19 i f e v a l u a t e ( p r opo s a l ) > s c o r e _ t h r e s :20 g r aph i c s _bboxe s . append ( p r opo s a l )21 r e t u r n g r aph i c s _bboxe s

Listing 1: pseudo code for Graphic Pixel-Words AnnotationGeneration

arX

iv:2

105.

1194

1v1

[cs

.CV

] 2

5 M

ay 2

021

understanding mobile gui: from pixel-words to screen-sentences

Documents