Project
The city of Paris has released a dataset containing a list of all connections made in 2016
on the ~300 public wifi hotspots
, available on their open data website. The dataset is 1.7M rows
long, and features a number of interesting information for each session such as time, volume downloaded/uploaded, device, browser, OS and language. I have explored this dataset to see how much touristics Paris is, and find insights about most frequented places, top countries and languages, and preferred equipment.
Results
Preparing the data
Setup and libraries
We will several of the standard tidyverse libraries for data wrangling and plotting. For the maps, we will also be using the ggmap package.
library("tidyverse")
library("stringr")
library("lubridate")
library("ggmap")
library("knitr")
Load and format main file
wifi.raw <- read_delim(
"./data/wifi-data-2016.csv",
progress = FALSE,
delim = ";",
na = c("NULL", ""),
col_types = "ccnnnccccc"
)
glimpse(wifi.raw)
## Observations: 1,668,609
## Variables: 10
## $ start_time <chr> "2016-05-27T20:38:00+02:00", "2016-05-27T20:34:0...
## $ stop_time <chr> "2016-05-27T21:00:00+02:00", NA, "2016-05-27T21:...
## $ duration <dbl> 1075, 302, 2525, 1164, 2102, 1050, 1570, 4501, 9...
## $ input_octets <dbl> 3487928, 1586655, 175261111, 2443957, 4681103, 8...
## $ output_octets <dbl> 638911, 321442, 13506281, 144435, 583246, 251518...
## $ os <chr> "Mac OS X", "Android", "iOS", "iOS", "iOS", "And...
## $ browser <chr> "AppleMail", "Android", "Mobile Safari UIWebView...
## $ device <chr> "Other", "Samsung GT-S5369-ORANGE/S5369BVL", "iP...
## $ langue <chr> "fr_FR", "fr_FR", "fr_FR", "en_US", "fr_FR", NA,...
## $ site <chr> "Centre d'Animation Espace Beaujon", "Jardin Gar...
wifi <-
wifi.raw %>%
mutate(start_time = as_datetime(str_replace(start_time, "T", " ")),
stop_time = as_datetime(str_replace(stop_time, "T", " "))) %>%
mutate(input_mo = round(input_octets/1024^2, 2),
output_mo = round(output_octets/1024^2, 2),
language = str_match(langue, "([a-z]{2})?_?([A-Z]{2})?")[,2],
country = str_match(langue, "([a-z]{2})?_?([A-Z]{2})?")[,3]) %>%
select(-c(input_octets, output_octets)) %>%
arrange(start_time)
Add new features
In order to improve our analysis, we can compute and add new features based on the current information we have.
First, let’s simplify all OS into fewer, clean categories:
# Get OS Type
wifi <-
wifi %>%
mutate(os_type =
if_else(str_detect(os, "(Debian|Fedora|Mageia|MeeGo|Ubuntu|BSD)"),
"Linux",
if_else(str_detect(os, "(Windows|Mac OS|Linux|Chrome OS|Kindle|Black[Bb]erry|iOS|Android)"),
str_extract(os, "(Windows|Mac OS|Linux|Chrome OS|Kindle|Black[Bb]erry|iOS|Android)"),
"Other")
)
)
Then, let’s extract the device brand, based on the OS (for Apple), or the full device name:
# Get Device Brand
wifi <-
wifi %>%
mutate(device_brand =
if_else(os %in% c("iOS", "Mac OS X"), "Apple",
case_when(str_detect(device, "^L-EMENT") ~ "Logicom",
str_detect(device, "Kindle") ~ "Amazon",
TRUE ~ str_to_title(str_sub(str_extract(device, "^[:alpha:]{2,}[ -]"), end = -2))
)
)
)
Finally, let’s try to guess if the device is a Computer or a Mobile/Tablet:
# Get Type of Device
wifi <-
wifi %>%
mutate(device_type =
if_else(str_detect(browser, "Mobile"),
"Mobile/Tablet",
if_else(str_detect(os, "(Mobile|Phone|Bada|Symbian)"),
"Mobile/Tablet",
if_else(str_detect(os_type, "(Android|Blackberry|iOS)"),
"Mobile/Tablet",
if_else(str_detect(os_type, "(Windows|Linux|Mac OS)"),
"Computer", ""),
""),
""),
"")
)
wifi$device_type[wifi$device_type == ""] <- NA
Join with map of hotspots
As the location of the hotspots is in a different source, we need to join it with our main dataframe. However, even after having done some cleaning on the names of the sites, there were still inconsistencies due to incorrectly spelled site names. So I had to finish the match table manually in a spreadsheet, which I import below as a csv file:
# Load map of hotspots
wifi.map <- read_csv2("./data/wifi-hotspots-list.csv")
names(wifi.map) <- c("Id", "Nom", "Address", "CP", "City", "GeoPoint")
# Join, using semi-manually cleaned correspondance file
wifi.map <-
read_csv("./data/wifi-hotspots-list-clean.csv") %>%
left_join(wifi.map, by = c("Id" = "Id")) %>%
select(site_key = Site.1, site_clean = Site.Clean, site_id = Id, site_xy = GeoPoint) %>%
filter(!is.na(site_xy)) %>%
separate(site_xy, into = c("site_geo_y", "site_geo_x"),
sep = ", ", convert = TRUE)
# Create hotspot coordinate table, for later use
wifi.corresp <-
wifi.map %>%
select(site_id, site_clean, site_geo_x, site_geo_y) %>%
rename(site = site_clean) %>%
distinct()
# Join main dataframe with the clean hotspot mapping
wifi <-
wifi %>%
left_join(wifi.map, by = c("site" = "site_key")) %>%
select(-site) %>%
rename(site = site_clean)
Analysis
Now let’s jump into the fun part. We’ll try to answer five questions:
- What is the most popular hotspot ?
- Is French the most frequent language on devices ?
- When do people connect the most ?
- What is the proportion of mobile vs computers ?
- What is the most popular device brand ?
1. Most popular hotspots
Let’s take a first look at the most used hotspots, in terms of number of connections and output volume (in Mo).
graph.popular <-
wifi %>%
filter(!is.na(site_id)) %>%
group_by(site_id, site) %>%
summarise(connections = n(),
avg_volume = sum(output_mo) / connections) %>%
arrange(desc(connections)) %>%
left_join(wifi.corresp) %>%
ungroup()
graph.popular %>%
select(site, connections, avg_volume) %>%
head(10) %>%
kable(col.names = c("Site", "Connections", "Average Volume (Mo)"),
digits = 2, row.names = TRUE)
Site | Connections | Average Volume (Mo) | |
1 | Square Jean XXIII | 91691 | 3.85 |
2 | Centre d’Accueil Kellermann | 45496 | 7.04 |
3 | Berges de Seine Rive Gauche - Gros Caillou | 40797 | 2.88 |
4 | Jardin d’Eole 1 | 38778 | 5.72 |
5 | Bibliothèque Historique de la Ville de Paris | 33443 | 5.17 |
6 | Parc Champs de Mars | 32459 | 3.55 |
7 | Bibliothèque Yourcenar | 32134 | 4.49 |
8 | Square Louis XIII | 30718 | 2.60 |
9 | Médiathèque Marguerite Duras | 27730 | 4.16 |
10 | Parc Monceau 1 (Entrée) | 23743 | 3.60 |
The most popular hotspot by far is the square just behind the Notre-Dame cathedral, which is a very touristic area, so this result does not come as a surprise. Hotspots 3, 6 and 8 are also places with high tourist frequentation. Compared to the other hotspots, the average output volume per session is lower, which makes sense as the latter are probably more used for work.
We can plot a map of all hotspots and their number of connections. I customized the Google Map style using the dedicated tool.
# Define a map style and save it for later
googlemap.style <- "style=element:geometry%7Ccolor:0x242f3e&style=element:labels%7Cvisibility:off&style=feature:administrative.land_parcel%7Cvisibility:off&style=feature:administrative.neighborhood%7Cvisibility:off&style=feature:poi.park%7Celement:geometry%7Ccolor:0x263c3f&style=feature:road%7Celement:geometry%7Ccolor:0x38414e&style=feature:road%7Celement:geometry.stroke%7Ccolor:0x212a37&style=feature:road.highway%7Celement:geometry%7Ccolor:0x746855&style=feature:road.highway%7Celement:geometry.stroke%7Ccolor:0x1f2835&style=feature:transit%7Celement:geometry%7Ccolor:0x2f3948&style=feature:water%7Celement:geometry%7Ccolor:0x17263c"
googlemap.graph <-
get_googlemap(center = c(lon = 2.33, lat = 48.86), zoom = 12,
size = c(640, 500), scale = 2, maptype = "roadmap",
style = googlemap.style)
# Plot hotspots by volume
ggmap(googlemap.graph) + theme_void() +
geom_point(data = graph.popular, colour = "yellow",
aes(x = site_geo_x, y = site_geo_y, size = connections,
alpha = avg_volume))
2. Top languages
Let’s know look at the most popular languages based on the device information:
# Top 10 languages
graph.languages <-
wifi %>%
group_by(language) %>%
summarise(percent = n() / nrow(wifi) * 100) %>%
arrange(desc(percent)) %>%
filter(!is.na(language)) %>%
top_n(10)
ggplot(graph.languages, aes(reorder(language, percent), percent)) +
geom_col() +
geom_text(aes(label = paste(round(percent, 1), "%")),
nudge_y = 0.5, hjust = "left") +
coord_flip() +
theme(panel.grid = element_blank()) +
ylim(0, 50) +
labs(y = "Share %", x = "Device Language")
French represents less than half of the sessions on public wifi, followed by English. As UK and US citizens are the 2nd and 3rd among foreign visitors in Paris, this is consistent. Spanish and Germans are also within the top 5 nationalities visiting Paris.
We can now plot a map showing the most used language by hotspot. While most are French, a number of very touristic areas have more English devices. There is also one hotspot where the most frequent language is Arabic.
graph.language <-
wifi %>%
filter(!is.na(site_id)) %>%
group_by(site_id, language) %>%
summarise(connections = n()) %>%
ungroup() %>%
group_by(site_id) %>%
mutate(lang_pct = round(connections / sum(connections), 4),
total_count = sum(connections)) %>%
arrange(site_id, desc(connections)) %>%
filter(!is.na(language)) %>%
top_n(n = 1, wt = lang_pct) %>%
left_join(wifi.corresp)
ggmap(googlemap.graph) + theme_void() +
geom_point(data = graph.language,
aes(x = site_geo_x, y = site_geo_y, size = connections,
colour = language, alpha = lang_pct))
If we zoom in on locations where French is not the most frequent language, we can confirm our assumption about highly touristic areas:
graph.language %>%
ungroup() %>%
filter(language != "fr", total_count > 5000) %>%
arrange(desc(total_count)) %>%
transmute(site, language, lang_pct * 100, total_count) %>%
kable(col.names = c("Site", "Language #1", "Share %", "Connections"),
row.names = TRUE)
Site | Language #1 | Share % | Connections | |
1 | Square Jean XXIII | en | 39.12 | 91691 |
2 | Berges de Seine Rive Gauche - Gros Caillou | en | 29.69 | 40797 |
3 | Jardin d’Eole 1 | ar | 38.64 | 38778 |
4 | Parc Champs de Mars | en | 34.79 | 32459 |
5 | Square Louis XIII | en | 30.85 | 30718 |
6 | Square Rene Viviani | en | 42.92 | 19583 |
7 | Musee du Petit Palais | en | 37.09 | 18692 |
8 | Parvis de L’Hôtel de Ville (Seine) | en | 30.88 | 10203 |
9 | Jardin d’Eole 2 | en | 42.00 | 8276 |
10 | Berges de Seine Rive Gauche - Solferino | en | 37.69 | 6286 |
3. Connections per day
We can start by looking at patterns in connections through the whole year, with a grid plot by day.
graph.dayofyear <-
wifi %>%
mutate(day = day(start_time),
month = month(start_time)) %>%
group_by(month, day) %>%
summarise(connections = n())
ggplot(graph.dayofyear, aes(factor(day), reorder(month, -month))) +
geom_raster(aes(fill = connections)) +
scale_fill_distiller(palette = "Blues", direction = 1) +
labs(x = "Day of month", y = "Month")
It appears that there is a problem in the data: no connections have been recorded for the first 3 days of April through December, while there is are much more connections on 4-12 Janury to March. There is obviously a mixup in the dates, months and days have probably been inverted for the first 3 days of each month.
Given that the date-times are not reliable in this dataset, I will not go further in the analysis by day and hour.
4. Mobile vs Computer
On average, 67% of all connections are made on Mobile/Tablet. But where do people use mobiles/tablets and computers the most ?
graph.device_type <-
wifi %>%
filter(!is.na(site_id)) %>%
group_by(site_id, device_type) %>%
summarise(connections = n()) %>%
mutate(mobile_pct = connections / sum(connections)) %>%
filter(device_type == "Mobile/Tablet") %>%
left_join(wifi.corresp) %>%
ungroup()
table.device_type <-
graph.device_type %>%
select(site, connections, mobile_pct) %>%
mutate(mobile_pct = round(mobile_pct * 100, 1)) %>%
filter(connections > 5000) %>%
arrange(mobile_pct)
rbind(head(table.device_type, 5), c("...", "...", "..."), tail(table.device_type, 5)) %>%
kable(col.names = c("Site", "Connections", "Mobile Share %"))
Site | Connections | Mobile Share % |
Bibliothèque Historique de la Ville de Paris | 6978 | 20.9 |
Médiathèque Marguerite Duras | 9421 | 34 |
Bibliothèque Yourcenar | 12378 | 38.5 |
Maison Initiatives Etudiantes | 7760 | 40.2 |
Bibliothèque Trocadero | 5128 | 41.1 |
… | … | … |
Berges de Seine Rive Gauche - Solferino | 5889 | 93.7 |
Parc Champs de Mars | 30468 | 93.9 |
Square Louis XIII | 28964 | 94.3 |
Square Jean XXIII | 86610 | 94.5 |
Parvis de L’Hôtel de Ville (9_place) | 5309 | 94.6 |
Unsurprinsingly, libraries and other studying places have the highest computer usage, while touristic areas have almost exclusive mobile usage.
Let’s recap this on a map, with a color indicating if the hotspot is used more by Mobiles or Computers.
ggmap(googlemap.graph) +
theme_void() +
geom_point(data = graph.device_type,
aes(x = site_geo_x, y = site_geo_y, size = connections,
colour = mobile_pct), alpha = 0.5) +
scale_colour_distiller(type = "div", palette = 4)
5. Preferred brands
Note that about 27% of connections do not have a country indication from the device, and that we have identified only 67% of devices brands. However, this still leaves us with 1.1M connections to analyse.
wifi %>%
group_by(device_brand, device_type) %>%
summarise(percent = (n() / nrow(wifi)) * 100) %>%
arrange(desc(percent)) %>%
filter(!is.na(device_brand) & !is.na(device_type)) %>%
head(10) %>%
kable(digits = 1, col.names = c("Brand", "Device Type", "Share %"))
Brand | Device Type | Share % |
Apple | Mobile/Tablet | 28.7 |
Samsung | Mobile/Tablet | 17.3 |
Apple | Computer | 9.3 |
Lg | Mobile/Tablet | 1.2 |
Lumia | Mobile/Tablet | 1.1 |
Htc | Mobile/Tablet | 0.9 |
Generic | Mobile/Tablet | 0.8 |
Huawei | Mobile/Tablet | 0.8 |
Microsoft | Mobile/Tablet | 0.6 |
Asus | Mobile/Tablet | 0.5 |
Apple is clearly leading with 38% overall share of devices on mobile and computers, followed by Samsung. All other brands are below 2%.
Are there significant differences depending on the country ? We can look at the top brands on the 10 most frequent country, and see that Apple and Samsung are consistently the first 2 brands (except for Apple on Arab Emirates devices, for some reason that I couldn’t explain otherwise than data corruption), while the 3rd brand has higher variability but a much smaller share of market.
wifi %>%
group_by(country, device_brand) %>%
summarise(count = n()) %>%
filter(!is.na(device_brand), !is.na(country)) %>%
ungroup() %>%
group_by(country) %>%
mutate(country_total = sum(count),
percent = count / country_total * 100) %>%
arrange(desc(country_total), desc(percent)) %>%
top_n(3) %>%
mutate(rank = row_number(-percent),
device_brand = paste(device_brand, "-", round(percent, 1), "%")) %>%
head(30) %>%
select(country, country_total, device_brand, rank) %>%
spread(rank, device_brand) %>%
arrange(desc(country_total)) %>%
select(-country_total) %>%
kable(col.names = c("Country", "Brand #1", "Brand #2", "Brand #3"))
Country | Brand #1 | Brand #2 | Brand #3 |
FR | Apple - 59.4 % | Samsung - 25.7 % | Lumia - 1.9 % |
US | Apple - 74.9 % | Samsung - 15.1 % | Nexus - 1.5 % |
GB | Apple - 43.7 % | Samsung - 38.6 % | Lg - 2.4 % |
ES | Apple - 57.4 % | Samsung - 22.9 % | Aquaris - 4.3 % |
IT | Apple - 52.4 % | Samsung - 24.8 % | Lumia - 4.1 % |
DE | Apple - 55.9 % | Samsung - 26.9 % | Htc - 4 % |
BR | Apple - 67.2 % | Samsung - 22.9 % | Lg - 3.4 % |
CA | Apple - 71 % | Samsung - 17 % | Lg - 4.1 % |
CN | Apple - 79.3 % | Samsung - 6.9 % | Huawei - 4.3 % |
AE | Samsung - 99.3 % | Microsoft - 0.4 % | Lumia - 0.2 % |