Title: Automation and Customization of Rendered Web Pages
1Automation and Customization of Rendered Web
Pages
- Michael Bolin, Greg Little, Marcos Ojeda, Matt
Webber, Philip Rha, Tom Wilson, Rob Miller MIT
CSAIL - http//uid.csail.mit.edu/chickenfoot
- Supported by NSF IIS-0447800
2Web Applications
- The Web has become a major application platform
3Automating Repetitive Operations
- Bookmark my latest bank statement
- Download many links at once
- Fill in defaults for forms
4Transforming Appearance
- Change color scheme for better contrast
- Concatenate multiple pages
5Integrating Multiple Web Sites
- Bookstore has links for New Books, Used Books,
Auction but not for my local library - Realtor has lots of data about houses for sale
but not length of my commute
6Web Apps Are Wonderfully Open
- Web apps have automatic hooks for scripting
- Display machine-readable HTML
- Commands generic HTTP requests
- Presentation editable HTML, stylesheets
- Web screen scraping is already common, mainly
behind the scenes (e.g., pricescan.com) - But most users dont do it
7Problem Many Web Apps Require A Browser
- Many web apps depend on the rich browser
environment - Cookies, authentication, SSL, session IDs,
plugins, user-agents, client-side scripting,
proxies - Perl/Python scripts run outside the browser, so
they cant easily access these web apps - Solution do customization in the browser
- Greasemonkey for Firefox
- User Javascript for Opera
8Problem Web Apps Are Scary Under the Hood
- HTML source of most sites is complex
- This complexity is a real barrier to automation
customization
9Solution Use Rendered View
- Chickenfoot user shouldnt have to look at HTML
source to customize the Web
10Outline
- Demo
- Language
- Commands
- Keyword patterns
- Implementation
- Pattern matching algorithm
- Evaluation
11Chickenfoot Language
- Chickenscratch Javascript runtime library
- Javascript syntax
- Standard browser objects
- document.links
- window.open()
- Document Object Model (DOM)
- Node, Element, Text, Range
- Chickenfoot-specific objects and commands
12Commands
- Page navigation
- go(url) openTab(url)
- fetch(url)
- Clicking and form manipulation
- click(button-or-link) check(checkbox-or-radio)
- enter(textbox, value) pick(listbox, choice)
- Pattern matching
- find(pattern)
- Page modification
- insert(pattern, html) replace(pattern, html)
- remove(pattern)
- Widgets input handling
- new Link(html, action) onClick(pattern, action)
13Keyword Patterns
- Keywords component type
-
- Component type is optional for click(), enter(),
check(), pick() - Nested pattern matching
- find(start address form).find(city textbox)
feeling lucky button
depart textbox
search web form
14Keyword Patterns vs. Other Names
- Keyword
- all words textbox
- Javascript
- document.f.as_q
- XPATH
- //body/form/table1/tbody/tr/td/table/tbody/tr
0/td/ table/tbody/tr/td1/table/tbody/tr
0/td1/input
lttdgtwith ltbgtalllt/bgt of the wordslt/fontgtlt/tdgt
lttdgtltinput value"" name"as_q" size"25"
type"text"gt
15Pattern Matching Algorithm
- Find labels matching the keywords
- Find components matching each label
- Rank choose best
Pattern
Ranked list of components
google search button
Matcher
Web page
1.0
0.5
0.5
161. Find Labels Matching Keywords
- Label visible chunk of text
- text nodes
- button labels, listbox items
- ALT attributes on images
- Tolerant matching
- capitalization
- word ordering
- punctuation
- typos
with ltbgtalllt/bgt of the words
172. Find Component Matching Label
- Search in rendered view
- Component must be aligned with label
- Degree of match given by
- pixel distance
- relative position
- HTML path length
183. Rank the Matching Components
- Rank score for each ltlabel,componentgt pair is
computed from - Match between keywords and label
- Match between label and component
- Highest-ranked component is returned
- If theres a tie, find() returns the ambiguous
matches, but click/enter/pick/check() throw an
error
19Evaluation
- Web-based survey of textbox naming
- 40 respondents (24 programmers, rest not)
- Comprehension which textbox on the page is
identified by this pattern? - Generation how would you identify this textbox
uniquely using only words visible on the page?
20Results of Generation Task
Patterns for which algorithm found
Right match
Wrong match
Multiple matches
0
26
14
21Disambiguation Strategies
- Keywords from section heading
- above person not available Mi
- Counting
- second mi
same caption
22Future Work
- More component types for patterns
- Programming by demonstration
- Pointing at page to generate patterns
- Clicking form filling to generate scripts
- Javascript syntax extensions
box
image
table
23Conclusion
- Chickenfoot automates and customizes web
applications without looking under the hood - Simple language
- Keyword patterns
- Developmentenvironmentin web browser
http//uid.csail.mit.edu/chickenfoot