AN INTEGRATED SOCIAL NETWORK AND CONTENT ANALYSIS WEB CRAWLER
We develop a web crawler to analyze virtual communities that will integrate social network and cultural theories from the social sciences into a software tool that is able to answer various kinds of research questions about the structure, culture and activities of virtual communities.
The internet's growing importance for social activity - be it market transactions, distribution of cultural products or organization of social movements - is widely recognized. However, despite the enormous amount of data electronically and publicly available, the social analysis of the internet is still in its infancy. We believe that one of the reasons for this state is that social scientific theories and methods, such as social network analysis, content and sentiment analysis have not yet been encapsulated in software programs - enabling users to easily answer practically or theoretically relevant questions about online social activity. Our mixed team of social and computer scientists tries to change this, developing a web crawler that combines the analysis of network structures and textual information on the web - with specialized integrated modules for forums and, in the future, blogs.
The kinds of questions that our program will be able to adress are the following:
- How are websites dealing with a particular topic interlinked? How do they relate to other topics?
- Which are the most influential sites? Which characteristics do they share?
- Which major viewpoints are expressed in a virtual community? How are they distributed over various websites?
- Which expressed feelings are dominant in a specific forum?
- How are ethnic communities represented on the web? And what does that tell us about their real-world behavior?
The crawler in action
A centralized and distributed virtual community compared:
The virtual community on the left the left is very densely connected and centralized, while the right one is distributed and decentralized. This visual impression is also confirmed by the reported measures of Betweenness Centrality of 0.711 and 0.253, respectively.
The salience of emotional constructs in different forums:
The salience of different emotions is measured through dictionairies that are automatically applied to the texts found in forums. As a result, the salience of different expressed emotions - such as anxiety, sadness and anger - can be compared across different forums.
Technical information and interface:
The developed software builds on the open source crawler Nutch's crawling and indexing, combined with our customized data structures & site management. The great amount of functionalities and specifications available will be easily accessible to the user through an intuitive interface, default options, but also advanced possibilities to change parameters.
- Presentation: Gives a more detailed description of the crawler and contains various screenshots.
- Poster: Shortly presents the most important facts about the crawler.
- AFOSR Grant: This new grant will develop the web-crawler further, beyond the duration of the CCPV project.
- ONR BAA 08-025 - Online Social Behaviors and Prediction of their Implications for the Physical World - A Comparative study: Building on the work of this project, this new grant will test if data collected with the crawler can be used as input for models designed to predict real-world behavior.