Characterizing web sessions: - Characterizing Web user sessions [SIGMETRICS 1998] -- *http://doi.acm.org/10.1145/362883.362920* -- Abstract: This paper presents a detailed characterization of user sessions to the 1998 World Cup Web site. This study analyzes data that was collected from the World Cup site over a three month period. During this time the site received 1.35 billion requests from 2.8 million distinct clients. This study focuses on numerous user session characteristics, including distributions for the number of requests per session, number of pages requested per session, session length and inter-session times. This paper concludes with a discussion of how these characteristics can be utilized in improving Web server performance in terms of the end-user experience. - Communication and Information: Alternative Uses of the Internet in Households -- *http://doi.acm.org/10.1145/274644.274695* -- Kraut, et al, CHI 98 -- Abstract: The Internet has been characterized as a superhighway to information and as a high-tech extension of the home telephone. How are people really using the Internet? The history of previous technologies that support interpersonal communication suggests that communication may be a more important use and determinant of participants' commitment to the Internet than is information acquisition and entertainment. Operationalizing interpersonal communication as the use of electronic mail and information acquisition and entertainment as the use of the World Wide Web, we analyzed longitudinal data from a field trial of 229 individuals in 110 households during their first year on the Internet. The results show that interpersonal communication is a stronger driver of Internet use than are information and entertainment applications. -- Collected: Session count (login/logout), internet hours (logged in), email use (sessions and sent/received count), web use (domains or sites visited, volume measured by number of different domains accessed during the week -- highly correlated with average weekly html pages [still true?]) -- Results: --- Internet hours highly correlated with email and web sites --- Email only moderately correlated with Web use --- Preference for email over web considered preference for interpersonal communication rather than info/entertainment --- Email ongoing, web satisfies a bounded goal --- In analyses not reported in this paper, we tested this idea by examining participants' loyalty to Email addresses and Web domains over time and found that people were two or three times more likely to reuse an Email address than they were to revisit a Web domain, even a year after its first use. [p. 373] - Interactive path analysis of web site traffic [Conference on Knowledge Discovery in Data, 2001] -- *http://doi.acm.org/10.1145/502512.502574* -- Paths through a single web site - Possible source material in work done on web personalization (based on navigation patterns), but these will tend to focus on intrasite interactions? Possible measures: - Loyalty (over different periods of time) - Initiation: -- Initiated from bookmarks -- Initiated from index -- Initiated from search -- Initiated from external (manually typed, clicked from email) Characterizing web content: Crawling: - 370K unique urls, approx 90K accessed more than once -- 370K would take about a week to crawl at 2/sec - General process using wget -# Retrieve with headers (-S) and directories (-x) -#- Or use url as name? Would probably overload filesystem. Need to consider this when retrieving headers -#- Or put in db? Need to have a better understanding of total size -# Truncate and compress -# Save starting place in file Text classification packages: - See "text classification" on *http://www.freshmeat.net/* - dbacl: *http://www.lbreyer.com/gpl.html*