«We tend to assume that digital [content] is forever. But anyone who accumulates enough information also knows that sometimes its difficult to find it, in other cases it breaks and, of course, there is a non-zero probability that things go wrong when hosted by third-party services. It is an old topic here, remember Will we have all this information in the future? . The topic resurfaces as news in the light of Currently charged by the article that can be read at A Year After the Egyptian Revolution, 10% of Its Social Media Documentation Is Already Gone».
In the comments, Anónima said: «Given a time t and an interval Δt, the larger Δt, the more likely is that all information in a time t-Δt you want to find is gone». This sounded like an statement to check, Thus, I decided to do an experiment with del.icio.us' bookmarks.
In delicious.com/rvr I have archived around 4000 links from 2004. So, I downloaded the backup file, an HTML file with all links and metadata (date, title, tags). I developed a python script to process this file: go through the links and save its current status (whether the link is alive or not). With another script, the status were processed to generate the statistics. These are the results:
As can be seen, there is a correlation between the age of the links and the probability of being dead. For the 10% who cited the Egyptian revolution, in the case of my delicious, we must go back three years ago (2009). But at 6 years from now, a quarter of the links are now defunct. Of course, the sample is very small shouldn't be representative. It would be interesting to compare it with other accounts and to extend the time span: How many links are still alive after 10 or 15 years? Is it the same with information stored in other media? Are all this death links resting in peace in a forgotten Google's cache disk?
I imagine that sometime in the future, librarians will begin to worry not only to digitize remote past documents, but also to preserve those of the present.
Brief analysis of 150,000 photographs from Flickr in the province of Malaga.
It identifies the profile and preferences of tourists.
Last Saturday, I was in Malaga. I was invited by Sonia Blanco and the Universidad Internacional de Andalucia to participate in workshop on Tourism and Social Networks. Sonia is professor at the University of Malaga, and one of the oldest bloggers in the Spanish blogosphere. Sonia asked me to present the analysis Fernando Tricas and myself did about Flickr photos and the Canary Islands (2009-2010), and I gladly accepted. I wanted to bring an update, so we got to work to make a short presentation with data from the province of Malaga. And that's what is shown below.
Video
Last Thursday, with the presentation already made, Fernando passed me an interesting link, a visualization by the Wall Street Journal that shows the density of a week of Foursquare check-ins in New York . If the WSJ could do it, so do we ;) We already had the data and the map algorithms, so generated the maps by months and joined them to build the animation.
The video below shows the density of photographs taken in the province of Malaga from 2004 to 2010. Blue colors are areas where they make some pictures, and the red areas have made many pictures. There are areas with many photographs, places of touristic interest. And of course, there are months where the activity is higher and lower.
Data
The video is just a bit of whole presented analysis. Full version is available below.
As you may know, Flickr is a popular photo-sharing service with 5 billion of hosted images and 86 million unique visitors. Flickr has social networking features, since it allows to make contacts. Flickr can play a role in the promotion of tourist destinations, as it is one of the main sources of images on the Internet. But to us, Flickr is a huge source of data: Which are the most photogenic places? Who are taking pictures there? These and other questions can answered using data mining.
For this study we obtained the metadata of 175,000 photographs (62,000 geolocated), 7,900 photographers and 1,470,000 tags (47,000 unique). All these pictures were either marked by the tag "malaga" or GPS coordinates were inside the province of Malaga.
Analysis
Below are the five most relevant slides: the tag cloud, the number of photos and photographers by months, the top 10 countries of the geolocated photographers, the group of tags and heatmaps of the geolocated images.
According to those who share photos on Flickr about Malaga, we can conclude that:
The high season in Málaga is August (also, in April there is a Holy Week-effect.
Users come mainly from UK, USA, Italy, Germany, Madrid and Andalusia. (USA is probably overrepresented compared to real visitors).
They are interested in photography, beaches, festivals, fairs, nature, sea, birds, sky, parks.
Pictures are taken mainly in Málaga (capital), Ronda, Barcenilla and Benalmadena.
The full presentation slides show more features, such as geolocated photographs by countries. It is interesting to compare these data with the previous study on the Canaries. A more detailed analysis can be done, but the roundtable had limited time. This sneak peek shows the potential of social networking and geolocation services for market research. If you have any questions, ask in the comments!
Finally, my gratitude to the organization of the UNIA for the invitation and hospitality, to Daniel Cerdan for suggesting the title of the post and Fernando Tricas for his unconditional support.
The Cablegate set is composed of +250,000 diplomatic cables.
The total number sent by Embassies and Secretary of State is guessed.
One of the biggest mysteries in astrophysics is the dark matter. Dark matter can not be seen, it doesn't shine nor reflects light. But we infer its existence because dark matter weights, and modifies the path of stars and galaxies. Cablegate has its own dark matter.
According to WikiLeaks, 251,287 communications compose the Cablegate. But what is the real volume of cables between the Embassies and Secretary of State? Can we guess it? The answer is yes, there is a simple way to know it. Using the methodology explained below, the total number of communications between Embassies and the Secretary of State is guessed.
This are the results.
The dark matter of the Embassies.
Between 2005-2009, more than 400,000 non leaked cables are identified. In this case, the uncertainty is larger than with just one embassy due to the small number or released cables. The sum increased by 50% in just one week.
Curiously, the average size of the 1800 published cables is 12 KB. If this average is representative of the whole set, something I doubt, the total size of the 250,000 messages would be 350 MB.
Secretary of State.
In addition to embassies' communications, Cablegate has some cables from the Secretary of State. This messages are often quite interesting, because they request information or send commands to the embassies (eg 09STATE106750).
In 2005 and 2006 there is no released cable, and therefore the sum cannot be estimated. But between 2007 and 2009, the volume of cables sent by the Secretary of State is remarkable (so big, that I doubted that the record number was an ordinal number and not a more sophisticated identifier). Compare this graph with the one of the embassies. 2007 show more cables from the Secretary than all Embassies combined, but beware, because this trend can be reversed with better data.
This is the chart for Madrid Embassy, which ranks seventh in the number of leaked cables.
Between 2004-2009, the existence of at least 17,000 dispatches sent from Madrid can be deduced. In the same period, there are just 3500 leaked cables. The graph shows the breakdown by year. 2007 is leaked in a high percentage, the oppositat in 2004 and 2005. Also, the number of communications decreases progressively (Why? Maybe other networks are used instead of SIPRNet). The complete table is available in Google Docs.
Cablegate Dark Matter Howto
The Guardian published a text file with dates, source and tags of the 250,000 diplomatic cables included in the Cablegate. The content of this messages are being slowly released. (Using this short descriptions, I did an analysis of the messages related to Spain -tagged as SP-, and suggested the existence of communications related to the 2004 Madrid bombings and the Spaniard Internet Law. Later, El País published this cables, confirming the suspicions).
To infer the volume of communications the methodology is quite simple. Each cable has an identifier. For example, 04MADRID893 summaries the Madrid bombing on March 11th, 2004. This identifier can be broken into three parts:
04: Current year (2004).
MADRID: Origin (the Embassy in Madrid)
893: Record number?
What's that record number? Let's investigate. There are some cables sent on December 2004 from Madrid Embassy, as 04MADRID4887 (dated December 29, 2004). Its record number is "4887". Another message sent on February has ID 04MADRID527, record number "527". Looking to others cables dated on January, seems obvious that the record number starts at 1 and goes up, one by one, through the year. The record number is a simple ordinal value. Thanks to this simple rule, and reading the last cables of Madrid Embassy on December 2004, we know it sent ~4900 cables that year alone.
Ideally, the last cable of the year from each Embassy would be available, but the Cablegate data is not complete. Just fraction of the leaked messages has been published so far and those last cables of the year may not be leaked in Cablegate anyway. But, as can be seen in the graphics, this method allows to do an approximation.
The code used for the calculations is available at github (cablegate-sp) and has a BSD license.
Out of sight, out of mind.
One month after the first cable release, only two thousand messages has been published. At this rate it will take a decade to release all Cablegate content. Maybe not all messages are as relevant as those released so far, eg boring messages about visas. But if WikiLeaks has raised such a stir with just 2000 cables,I cannot imagine which other secrets remain in those thousands unfiltered (although top-secret cables use other networks).
Anyway, I'm sure there is still a lot of data mining job to do with the cables.
PS (December 30th, 2010): Ricardo Estalmán linked to this entry on Wikipedia about the German tank problem during World War II:
«Suppose one is an Allied intelligence analyst during World War II, and one has some serial numbers of captured German tanks. Further, assume that the tanks are numbered sequentially from 1 to N. How does one estimate the total number of tanks?»
The Cablegate case is quite similar. I will update the estimation with the formula cited in the above article, as soon as possible (Xmas days!).
Social networks play an increasing role in the tourist market.
This post shows an study which analyzes 150,00 photographs from Canary Islands on Flickr, a popular photo sharing site.
There is a boom in geolocated data and services.
This project begun as a research about tags, conceived by Fernando Tricas
(University of Zaragoza) and me. Fernando came with the idea to study the photos taken during Expo Zaragoza 2008, but I live in Canary Islands, so I preferred a closer region. Flickr is a photo sharing site with social network features, property of Yahoo. The study presented here analyzed 150,000 photographs and 4,000 users, obtained from the tag search "canaryislands" (and other languages as well) in Flickr, taken between 2004 and 2008.
The first part of the study was presented in October 2009, and we analyzed photographers profiles and a basic study of tags added to images. This second part shows an interesting analysis of geolocated images to study tourist behavior and points of interests.
Geolocation
Flickr is able to store longitude and latitude data of the uploaded photographs. There is an increasing number of cameras with integrated GPS, specially mobile telephones. Also, users can manually locate in a map the place where the photo was taken. From a total of 150,000 images with the tags "canarias", "canaryislands", etc, 36,000 images had geolocation data.
The first map of the presentation (slide 18), shows the raw geolocation of the 36,000 images: a point, an image. Capitals and main touristic places stand out, along other points of photographic interest.
The second image (slide 19), shows the geolocation of the 36,000 images weighed by image popularity on Flickr (the point size is proportional to the number of views of the photographs). This gives an idea of the most photogenic places, according to image viewers. Many of our maps use the number of views' weighing.
In slide 20, geolocated photographs taken by foreign tourists are shown. During the first part of the study, we successfully geolocated ~25% of the photographers using their profile data. In this map, just the images taken by non-Spaniard users are represented. Every point is an image and its size represents the popularity (number of views). This slide gives an idea of which places are most visited by foreing tourists.
For example, in Gran Canaria there aren't many photographs outside urban and touristic areas, with the exception of the center of the island. Compare this with the island of Tenerife, which shows many red coloured places besides Santa Cruz, Puerto de la Cruz y los Cristianos. But is the island of Lanzarote which stands out, both by the broadness of the points of interest and by image popularity.
Flickr assigns each place a unique identifier, identifiable using APIs de Yahoo. In slide 21, the 36,000 are grouped by places to show place popularity by number of taken photographs. Point size represents the number of photos in that place (not the number of views). Color represent type of place (town, airport, province...). Remember that maps generally show the whole set of 36,000 photographs, independently of the country of the photographers, except otherwise noted.
To display site popularity, slide 22 shows the aggregated number of views of each place, totaling the image views taken on each site. Capitals and main touristic areas are the most popular places in Flickr, but there are other zones of high visibility.
Photographs can be beautiful and/or popular, independently of the place it was taken. To know where photographers go, slide 23 shows the number of photographers who took images in each place. Point size weighs real visits, not image views. This slide displays all users, from Spain and abroad. Capitals stand out, but also touristic areas and other points of interests.
Dates
In the next four slides, from 24 to 28, photographs taken each year are shown, from 2004 to 2008. In the first part of the study, we showed a graph displaying the number of uploaded images to Flickr by date. This are the geolocated images. In this slides, 2004 shows few images compared to 2008. Each point is a photo and the circles weighs image popularity (number of views). This maps are helpful to identify photographer site preferences.
Tourists
Slide 29 displaysofficial statistics of tourist arrivals to Canary Islands by source country between 2003 and 2007. Top sources are United Kingdom, Germany, Holland and Ireland. In the first part of this study, source country of successfully geolocated photographers were shown, and stats were not exactly the same of the official ones. Slide 30 shows source country of geolocated photographers which also have geolocated images.
(Bear in mind that the origin of this set of 36,000 geolocated is a search by tags like "canarias", "canaryislands" and the like, and not a direct search by coordinates). Most geolocated images were taken by photographs from Spain, United Kingdom, Germany, Italy, Dominican Republic and Holland.
From slide 31 to 38, geolocated images are shown by source country:
United Kingdom, Germany, Italy, Holland, Ireland, USA, Belgium and France. In this maps, every point is a photo and its size weighs its popularity (number of views). Clearly, images taken by foreign tourists are very popular in Flickr. Also there are behavioral differences by country. There are tourists from UK in Gomera, Tenerife,
Gran Canaria, Fuerteventura and Lanzarote. In Gran Canaria and Tenerife, British stay at beaches and don't go outside touristic areas (rural tourism anyone?)... but they love to go all over Lanzarote. German photographers also take photos in rural sites of Gran Canaria and other islands.
From slide 39, the number of photographs are shown by source autonomous community. There are 17 autonomous communities in Spain. Top communities by photographers are Canary Islands followed by Valencia, Galicia,
Madrid, Vasque Country and Catalonia. This numbers are a bit different to those of the whole set (that is, including un-geolocated images), which ranks first Canary Islands, Madrid, Valencia and Cataluña. In the maps, each point is a photo and its size weighs the number of views. There are maps whose points of interests are difficult to spot: this photos weren't too popular in Flickr.
And beyond
This method to study users has a great potencial, specially interpreted in a tourist context. However, bear in mind that the original set of images where found searching by tags ("canarias", "canary islands") and that analyzed users are part of a photograph sharing site (quite popular). But, this photographers could not necessary represent the average tourists -and let's also remember that a good chunk of photographs are taken by local residents. Anyway, and taking this into account, this kind of studies can be very helpful to the touristic market.
There remains other data we got but aren't showed here, like the geolocation by hour taken. We also did a preliminary analysis to relate tags and geolocation. Would be interesting to compare this results to using other tag searchs ("grancanaria", "tenerife"). Also to include geolocation searchs to the set. Now that 2009 is over, to update the data...
The increasing popularity of check-in applications, like Foursquare, Gowalla and recent movements by Twitter and Facebook in this market, proves there is a lot to be done. For many years, Google used its search technology to display ads based on content. The rise of social networks is shifting the paradigm to display ads using user profiles. Foursquare and similar mobile apps add another layer: user profile and location. Privacy issues aside, smart companies can use this data to understand better their customers.
Thanks for reading this. Please use the contact data if you want to know more.
«First, the overwhelming trend is simply more CC-licensed images — an increase from 10 million to 135 million over four years — and we amusingly said 5 years ago that Flickr’s CC area had “gone way beyond our expectations” with 1.5 million licensed images».
A link on that post got my attention, their metrics wiki. In that wiki, world-wide Creative Commons adoption stats are available, last updated on November 2009. Stats are just indicative, because the methodology is quite simple and based on search engine results.
Creative Commons in the world
There are many interesting numbers. There are a total of 257 million results with a Creative Commons license, and the biggest percentage, 37% are by-sa license (attribution, share alike). The most permissive license (by) account just 10%, and the most restrictive (by-nc-nd) gets 20%.
I'm surprised by the high percentage of by-sa works. It could be an indication that authors actually care about the long term goals of Creative Commons.
As can be seen, Spain is the country with the highest amount of Creative Commons' licenced works -at least, according to Yahoo statistics. In absolute numbers, Spain has approximately 10 million works. This ranks seem stable at least in the last years.
If we relate the top 15 countries to adoption per capita, Spain ranks second with 0,223 works per inhabitant. Taiwan is first, with 0,229 per inhabitant.
In Spain's jurdistiction data page, the license with highest percentage is the less restrictive, by, with 28%. Second is by-nc-sa. Indeed, by " license freedom", Spain ranks 8th.
«Nevertheless, besides the contributions of South American authors, there is reason to believe that the awareness of CC licenses in Spain itself is high. The CC launch event (October 2004) and the Copyfight event (July 2005) have likely increased awareness, but a recent (March 2006) and widely publicized court case regarding the streaming of royalty-free music from the Internet in bars has probably also contributed to a heightened awareness and sensitization to intellectual property issues in Spain. We therefore assume that Spain holds a special position among CC jurisdictions mainly because of two contributing factors: language and high license awareness and promotion. Moreover, Spain is among those countries with a relatively matured information society and developed economy which nevertheless exhibit relatively high piracy rates (and low-to-zero levels of litigation against piracy and file-sharing). This also places Spain in the group of countries where liberal licensing approaches may be benefiting from a general social attitude that is friendly towards sharingsharers».
I don't think this analysis captures the right picture. Spaniard Internet users has long been aware of the copyright issues. I don't think that particular case played a major role, but legislative issues involving royalties applied to digital storage devices and the behavior of SGAE (a RIAA-like association) against "internet piracy". In the Spaniard legislation, private copies of copyrighted works are allowed between individuals, and royalties are imposed to storage and copying devices (both analogic and digital) which are collected by author/publishers associations. File sharing between individuals is legal, but royalties quite unpopular.
In my opinion, Creative Commons has been popular because of the same reason libre software has been also popular in Spain: a high percentage of Spain's internet users are highly compromised with a vision of an open and sharing culture. Almost all top bloggers promote open source software and free licenses. Of course this vision is not shared among traditional authors and big media corporations. Political parties feel the pressure of lobbies -from this country, and from other countries.
So the question, for us, is how much time will Spain remain the #1 country in the Creative Commons rank?
PD: The original post in Spanish spread like fire among Spanish blogs and news sites.
As reported in media, the number of active editors is stalled -or stabilized.
More women do contributions than men!
Last Febrary I was in Madrid and attended to a conference dedicated to Wikipedia, held in Medialab Prado. Medialab is a diy/open lab, which regularly organices talks and workshops on vanguard topics.
That day, Miguel Vidal and Felipe Ortega (University Rey Juan Carlos) were invited. Miguel is one of the oldest Spanish Wikipedia editors. Miguel explained how Wikipedia is organized and how to contribute (PDF) -there is life beyond the "edit" button!
Miguel's talk was great, but I was amazed by the second conference, maybe because it showed many graphics :) Felipe Ortega
has been working with Wikipedia data many years. In November
2009, Wall Street Journal profiled his analysis in the article Volunteers Log Off as Wikipedia Ages.
«Volunteers have been departing the project that bills itself as "the
free encyclopedia that anyone can edit" faster than new ones have been
joining, and the net losses have accelerated over the past year. In the
first three months of 2009, the English-language Wikipedia suffered a
net loss of more than 49,000 editors, compared to a net loss of 4,900
during the same period a year earlier, according to Spanish researcher
Felipe Ortega, who analyzed Wikipedia's data on the editing histories
of its more than three million active contributors in 10 languages.».
There are 3,203,546 registered editors in the 10 biggest Wikipedias.
Until October 2009, there were 296,387,800 editions in the English Wikipedia.
The editor's mean time in 346,9 days.
Important issues are clearly illustrated on the presentation graphics.
Slides 14 and 15 shows that the number of active articles remain constant, in every language analyzed, even if Wikipedia popularity was raising. Same to active discussions.
According to size, there are languages with two article "populations": short articles, counting around 100 words, and large articles, with ~350 words. Short articles are most common in English and Dutch Wikipedias. However, large articles are most common in Spanish Wikipedia (see slide 16).
Discussions play a major role in English Wikipedia: 80% of articles have discussions. Just the opposite of Spanish Wikipedia, because only 20% do have debates.
This numbers shows that every Wikipedia develops its own culture. Felipe also made somes comments relative to the results about Wikipedia user profiles.
There is a high level or participation. 65% of users are readers, 25% are casual editors and 10% are regular editors.
A fascinating data: Women are move involved than men in Wikipedia: 82% of women are editors, compared to 41% of men.
There is a high amount of total editors, but the most part of editions are done by a small community of highly active editors: 90% of revisions are done by 5% of editors.
36% of users have high school studies. The second biggest group are college graduates with 30%. 5% are doctorates.
About content quality:
There is an increasing number of reverted editions.
Featured articles average 1,000 days to obtain the distinction.
Featured articles have between 10 and 200 times more editors than normal articles.
Felipe said experienced editors are an invaluable resource for Wikipedia and must be preserved, but it's a difficult task, as they tend to burn out, and in the long term they become inactive.
To have a higher level of user involvement, Felipe thinks Wikipedia should be promoted in high school and colleges.
He mentioned usability studies being done to simplify edition and offer more information about editorial workflow. In my opinion, Wikipedia needs a far better user experience. To do small editions and corrections is easy, but longer contributions have a step learning curve. Some examples:
Everything's a wiki! But discussion pages would be better organized as comment threads, just like blogs. Users must learn how to add dates, times and names, it doesn't make sense.
Template promotion. English Wikipedia articles are better structured than Spanish ones, and template use is more widely used. Still, template use should be encouraged through easier edition workflows (specially at creation time).
I'm sure Wikipedia numbers will benefit from every effort put in user experience.
If you want to read more, take a look to Felipe Ortega site at URJC. Other graphics and tools are hosted at WikiXRay project.
The amount of references of Spanish political parties in journal sites and social media is analyzed.
Graphics about mass and social media references are discussed.
According to raw numbers, Partido Popular shines above the rest.
Smaller parties yield the best results related to number of votes.
In a previous entry, I published the number of references on the major Spanish cities in digital and social media. In this post, the experiment is repeated this time with Spanish political parties. Is the ruling party over-hyped? Or maybe minorities get too much news?
The methodology is quite simple: the party acronyms are search in Google News in Spanish, and we get the number of references. So be careful, because different meanings for the same acronym is not filtered out. Nor alternate searchs with full party names are executed (i.e just "PSOE" and not "Partido
Socialista"). But in some cases, Google News returns associated terms (i.e. when "PP" is typed, Google News also returns results for "Partido Popular").
The first graphic shows the raw number of references in Google News of each party.
The X-axis shows the parties, sorted by decreasing of votes (ruling party is PSOE). Y-axis shows the number of references. We can see the high amount of references that Partido Popular has in comparison to PSOE, almost two times! PSOE doesn't get much better numbers if more complex searchs are done (i.e. psoe OR partido socialista).
The second graphic shows the number of references related to number of voters.
Y-axis shows the number of mentions per 1000 voters. Contrary to common sense, big parties are sub-represented in the news, specially the ruling party, PSOE is the party which gets less news per vote. In the mention top Partido Nacionalista Vasco, Izquierda Unida and Bloque Nacionalista Galego stand out.
Politics and journals
How well parties are treated by journals? This graphics shows the number of references that each party gets in each digital journal.
Third graphic represents the % of coverage parties get in each media site. According to Google News, Partido Popular (right-wind party) gets the biggest amount of references, always above 30% and the biggest percentage is on El País (44%). The ruling party PSOE is always second, except in Catalonian journals, La Vanguardia and El Periódico, in which PSOE ranks third.
The fourth graphic shows the references of each party in media related to their number of voters in 2008 elections.
In this case, the total number of references per 100,000 votes are displayed. As you already saw in cities, the best party coverage comes from Abc journal. Data shows clearly that the biggest parties, PSOE and PP, are under-represented. Partido Nacionalista Vasco stands out, and has the highest amount of reference per voter in the main media sites (Spain). Data also shows that Abc pays special attention to Bloque Nacionalista Galego and Nafarroa Bai.
Parties and social media
How popular are political parties in blogs and social networks? Let's see. Warning: I'm not confident of Google's crawling of socials networking sites. They are still introducing "real time searches" and can be still can be considered as experimental feature. Second, because Google can only crawl public content, and there is a huge amounts of dark matter in Facebook (private content).
Fifth graphic shows the number of references of political parties in social media. For blogs, Google Blogsearch
(filtered by language, "spanish") and Bitacoras.com were used. For Facebook and
Twitter, Google was used, by filtering results by domains (twitter.com and facebook.com).
Again, PP is the party with the highest amount of references in every site, and blogsearch shows an overwhelming landslide. It's always above 40%, but in the case of blogs, Partido Popular gets 65% of total references: the difference between it and PSOE, the ruling party, are 55 points.
Sixth graphic shows the quantity of references in social media related to their number of voters.
In this case, percentage is displayed. Partido Nacionalista Vasco, Nafarroa Bai and Izquierda Unida are extensively coveraged in social media
related to their number of voters in 2008 elections. Major parties, PP and
PSOE, are under-represented, except in blogs. In the later case, PP has 4 times more mentions than PSOE, similar to IU's numbers.
Reflections
During 2008 elections, I published in my Spanish blog some numbers of the parties' channels on YouTube (Los usuarios de YouTube votan, segunda parte y tercera parte). But comparing the stats with the election results, it was clear that there is no direct correlation between the number of references and the actual votes. i.e. Parties can lead to heated debates. The same rule can be applied to the aforementioned graphics.
And contrary to popular belief, small parties are benefited both in digital and social media.
Finally, another warning: please take this data with skepticism. The methodology is quite simple and Google searchs could include artificial deviations. Raw data and graphics are available in this spreadsheet at Google Docs.
Please, leave your comments and opinions below, they will be welcomed.
In the previous post, I posted some numbers of media coverage of Spaniard cities using Google News which showed that Madrid and Barcelona rightly gets the most attention, but per population other cities are over-covered. In this one, media references of the 50 most populated US cities are posted. The data was extracted today, 8th March 2010.
This is the first graphic, which shows the number of references on Google News by each of the top 50 US cities. New York, Portland and Washington are the most common with more than 100,000 references each of them.
For fair comparison, the number of references must be compared to the population of each city.
The second graphic shows the number of references per 100,000 inhabitants of the top 50 cities of US. Surprisingly, or not, Washington receives an overwhelming media attention compared to its population, around 0.5 news articles per inhabitant. New York, the most populated city, only gets 0.03 articles per inhabitant. Clearly, Washington benefits from being the political center of US. Other over-referenced cities, according to their population, are Miami, Boston and El Paso.
Now, let's compare media coverage by news source.
The third graphic shows the number of city references found in Google News filtered by each of this three news sources: New York Times, Fox and CNN. The number of references is divided by the city population. Not sure whether it is a by product of Google News indexing, but New York Times is the best source covering the top 50 US cities.
Washington, Boston and New York are the preferred cities of New York Times. However, Fox prefers news from Washington, Cleveland and Boston. And finally, Washington, Atlanta and Miami are more likely to be cited by CNN.
Now let's see bloggers' bias.
US cities in the blogsphere
Google hosts Blogsearch, a search engine for blogs. Next graphic shows the number of posts referencing the top 50 US cities. Results are filtered by English language.
The four graphic ranks cities by the bulk number of references in the English-speaking blogsphere. New York, Washington and Chicago are the most cited US cities.
Next graph shows the number of references in Blogsearch per 1,000 inhabitants of the top 50 cities.
Washington, Boston and Miami have a high amount of references
related to their population. i.e. Blogsearch shows that Washington gets more than 200 posts per inhabitant! New York just gets around 20 posts.
A complain I hear often is Spaniard media over-cover Madrid and Barcelona to the detriment of the smaller cities. I was curious to know whether this was a fact, so I finally studied the number of references of the main Spaniard cities Google News.
This graph show the 20 most populated cities of Spain and its number of references in Google News.
The X-axis, orders the city according to its population, so Madrid has the most inhabitants and Elche the less (of this selection from the top 20). As we can see, Madrid and Barcelona get the most references, followed by Valencia, Vitoria, Seville and Zaragoza. This asserts the intuitive notion that Madrid and Barcelona get the most attention from the media, having more than 100,000 both of them.
However, for a fair comparison, the number of references must be compared to the population of each city.
This graphic shows the number of references per 1,000 inhabitants of the top 20 cities of Spain. As can be seen, Madrid and
Barcelona are on the average, while Palma de Mallorca, Las Palmas de Gran Canaria and Hospitalet de Llobregat are under referenced according to their population. Is it because this cities have large names? The references go up by two if their shorter names are search ("Las Palmas" and "Hospitalet") but their relative positions remain low.
Of the top 20, Vitoria receives an overwhelming attention according to its population. The raw data show that Madrid has 3.2
million inhabitants and Google News counts ~200,000 news. Vitoria is populated by 235,000 people and counts ~55,000 news.
León, Santander, Parla and Cádiz are among the most cited in the top 62 of most populated cities
of Spain. i.e. Santander city gets benefits from the actual references to Banco de Santander, because Google News doesn't distinguish the city and the bank.
The last graph compares the references to the 20 most populated cities in the top 3 Spaniard journals: El País, El Mundo y Abc.es.
As can be seen, Abc es
is the best journal covering of the Spaniard cities, specially Valladolid,
Alicante and Valencia. El País prefers Bilbao and
Valencia, while El Mundo, over-references to Barcelona and Bilbao.
Available in Google Docs are the tables with complete numbers of the 62 cities. Wikipedia is the source of the city list and their population (which quotes Statistics National Institute of Spain, INE).
Addendum: Spaniard cities in the blogsphere
Fernando Tricas suggested me to search city references in Blogsearch, Google's blog search engine. Results are filtered by language (Spanish). Next graph shows the number of references of the top 20 Spaniard cities in the Spanish blogs.
As can be ween, Madrid, Barcelona and Valencia get the highest amount of posts. Palma de Mallorca, Las Palmas de Gran
Canaria and Hospitalet de Llobregat are also under-mentioned in the blogosphere.
Next graph shows the number of references in Blogsearch per 1,000 inhabitants, of the top 20 cities.
Granada and Córdoba have a high amount of references
related to their population. On the contrary, Madrid and Barcelona receive little relative attention. This numbers are somewhat different to those of the media coverage. However, it is León and Cádiz who get the most relative attention when the top 62 is compared.
If you want to see the raw data, the results of the 62 cities and more graphics, take a look to this Google Docs' spreadsheet.
Last Monday at ESCOEX (business school) I attended a talk by Luis Suárez. Luis was invited by Néstor Domínguez, teacher and founder of MOM-SOS marketing agency. Luis is an IBM employee, and works in knowledge management, virtual communities and social tools.
He arrived to Maspalomas six years ago and since then he's connected with the rest of 500,000 IBM employees all around the world from this touristic site in Gran Canaria. Luis was born in Leon (Spain), loves the island's charm and usually adds photos in his blog about social software (wonderful job of touristic promotion!).
The class was quite interactive. Luis talked about his experiences as telecommuter and social tools evangelist inside IBM. Stands out his fight against email as an unproductive tool (his blogs shows this statement: Thinking outside the inbox). Next, the notes I transcribed during the class.
Recent Comments