Automation and Customization of Rendered Web Pages

1 / 23
About This Presentation
Title:

Automation and Customization of Rendered Web Pages

Description:

Web 'screen scraping' is already common, mainly behind the scenes (e.g., pricescan.com) ... Many web apps depend on the rich browser environment ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 24
Provided by: robmi7

less

Transcript and Presenter's Notes

Title: Automation and Customization of Rendered Web Pages


1
Automation and Customization of Rendered Web
Pages
  • Michael Bolin, Greg Little, Marcos Ojeda, Matt
    Webber, Philip Rha, Tom Wilson, Rob Miller MIT
    CSAIL
  • http//uid.csail.mit.edu/chickenfoot
  • Supported by NSF IIS-0447800

2
Web Applications
  • The Web has become a major application platform

3
Automating Repetitive Operations
  • Bookmark my latest bank statement
  • Download many links at once
  • Fill in defaults for forms

4
Transforming Appearance
  • Change color scheme for better contrast
  • Concatenate multiple pages

5
Integrating Multiple Web Sites
  • Bookstore has links for New Books, Used Books,
    Auction but not for my local library
  • Realtor has lots of data about houses for sale
    but not length of my commute

6
Web Apps Are Wonderfully Open
  • Web apps have automatic hooks for scripting
  • Display machine-readable HTML
  • Commands generic HTTP requests
  • Presentation editable HTML, stylesheets
  • Web screen scraping is already common, mainly
    behind the scenes (e.g., pricescan.com)
  • But most users dont do it

7
Problem Many Web Apps Require A Browser
  • Many web apps depend on the rich browser
    environment
  • Cookies, authentication, SSL, session IDs,
    plugins, user-agents, client-side scripting,
    proxies
  • Perl/Python scripts run outside the browser, so
    they cant easily access these web apps
  • Solution do customization in the browser
  • Greasemonkey for Firefox
  • User Javascript for Opera

8
Problem Web Apps Are Scary Under the Hood
  • HTML source of most sites is complex
  • This complexity is a real barrier to automation
    customization

9
Solution Use Rendered View
  • Chickenfoot user shouldnt have to look at HTML
    source to customize the Web

10
Outline
  • Demo
  • Language
  • Commands
  • Keyword patterns
  • Implementation
  • Pattern matching algorithm
  • Evaluation

11
Chickenfoot Language
  • Chickenscratch Javascript runtime library
  • Javascript syntax
  • Standard browser objects
  • document.links
  • window.open()
  • Document Object Model (DOM)
  • Node, Element, Text, Range
  • Chickenfoot-specific objects and commands

12
Commands
  • Page navigation
  • go(url) openTab(url)
  • fetch(url)
  • Clicking and form manipulation
  • click(button-or-link) check(checkbox-or-radio)
  • enter(textbox, value) pick(listbox, choice)
  • Pattern matching
  • find(pattern)
  • Page modification
  • insert(pattern, html) replace(pattern, html)
  • remove(pattern)
  • Widgets input handling
  • new Link(html, action) onClick(pattern, action)

13
Keyword Patterns
  • Keywords component type
  • Component type is optional for click(), enter(),
    check(), pick()
  • Nested pattern matching
  • find(start address form).find(city textbox)

feeling lucky button
depart textbox
search web form
14
Keyword Patterns vs. Other Names
  • Keyword
  • all words textbox
  • Javascript
  • document.f.as_q
  • XPATH
  • //body/form/table1/tbody/tr/td/table/tbody/tr
    0/td/ table/tbody/tr/td1/table/tbody/tr
    0/td1/input

lttdgtwith ltbgtalllt/bgt of the wordslt/fontgtlt/tdgt
lttdgtltinput value"" name"as_q" size"25"
type"text"gt
15
Pattern Matching Algorithm
  • Find labels matching the keywords
  • Find components matching each label
  • Rank choose best

Pattern
Ranked list of components
google search button
Matcher
Web page
1.0
0.5
0.5
16
1. Find Labels Matching Keywords
  • Label visible chunk of text
  • text nodes
  • button labels, listbox items
  • ALT attributes on images
  • Tolerant matching
  • capitalization
  • word ordering
  • punctuation
  • typos

with ltbgtalllt/bgt of the words
17
2. Find Component Matching Label
  • Search in rendered view
  • Component must be aligned with label
  • Degree of match given by
  • pixel distance
  • relative position
  • HTML path length

18
3. Rank the Matching Components
  • Rank score for each ltlabel,componentgt pair is
    computed from
  • Match between keywords and label
  • Match between label and component
  • Highest-ranked component is returned
  • If theres a tie, find() returns the ambiguous
    matches, but click/enter/pick/check() throw an
    error

19
Evaluation
  • Web-based survey of textbox naming
  • 40 respondents (24 programmers, rest not)
  • Comprehension which textbox on the page is
    identified by this pattern?
  • Generation how would you identify this textbox
    uniquely using only words visible on the page?

20
Results of Generation Task
Patterns for which algorithm found
Right match
Wrong match
Multiple matches
0
26
14
21
Disambiguation Strategies
  • Keywords from section heading
  • above person not available Mi
  • Counting
  • second mi

same caption
22
Future Work
  • More component types for patterns
  • Programming by demonstration
  • Pointing at page to generate patterns
  • Clicking form filling to generate scripts
  • Javascript syntax extensions

box
image
table
23
Conclusion
  • Chickenfoot automates and customizes web
    applications without looking under the hood
  • Simple language
  • Keyword patterns
  • Developmentenvironmentin web browser

http//uid.csail.mit.edu/chickenfoot
Write a Comment
User Comments (0)