Abstract
The digital economy is a highly relevant item on the European Union’s policy agenda. Crossborder internet purchases are part of the digital economy, but their total value can currently not be accurately measured or estimated. Traditional approaches based on consumer surveys or business surveys are shown to be inadequate for this purpose, due to language bias and sampling issues, respectively. We address both problems by proposing a novel approach based on supplyside data, namely tax returns. The proposed datadriven recordlinkage techniques and machine learning algorithms utilize two additional open data sources: European business registers and internet data. Our main finding is that the value of total crossborder internet purchases within the European Union by Dutch consumers was over EUR 1.3 billion in 2016. This is more than 6 times as high as current estimates. Our finding motivates the implementation of the proposed methodology in other EU member states. Ultimately, it could lead to more accurate estimates of crossborder internet purchases within the entire European Union.^{3}^{3}3The views expressed in this paper are those of the authors and do not necessarily reflect the policy of Statistics Netherlands.
Keywords: crossborder ecommerce, datadriven record linkage, digital economy, machine learning, official descriptive statistics, online consumption
1 Introduction
Accurate estimates of crossborder online consumption within the European Union (EU) are difficult to obtain. Recently, accurately estimating crossborder online consumption has become more important for two reasons. The first reason is that consumption through online channels of both goods and services is increasing within the EU, especially across borders. Consumers have more access to the internet, shipping costs are decreasing and payment services converge across countries [Marcus and Petropoulos2016, Cardona and DuchBrown2016, Martikainen et al.2015]. Accurate estimates of crossborder online consumption are thus of increasing importance for adequately reporting on national accounts by national statistical institutes. The second reason is that crossborder online trade is nowadays a highly relevant item on the EU’s policy agenda when considering the EU Digital Single Market (European Commission, COM/2010/0245). Therefore, getting grip on crossborder online consumption through reliable estimates is essential in quantifying the effect of any new policy. The need for accurate estimates on aspects of the digital economy within the European Union is emphasized by the eu2015monitoring.
Traditional methods for estimating consumption are based on either consumer surveys or business surveys. One EUwide consumer survey on crossborder online consumption is conducted by Ecommerce Europe (https://www.ecommerceeurope.eu). In the Netherlands, this survey is conducted by market research institute GfK (https://www.gfk.com). It is commissioned by the national ecommerce association in the Netherlands, Thuiswinkel.org (https://www.thuiswinkel.org), on behalf of Ecommerce Europe. The estimates of total crossborder online consumption are based on asking consumers how much they spent at foreign webshops over a fixed time period in the past. However, webshops selling goods or services typically operate in a country using a website in the regional language [Schu and Morschett2017]. This observation corresponds to the main impediment of online consumption, which is foreign language, rather than security reasons, shipping costs, geographical distance or available payment services [GomezHerrera et al.2014]. Therefore, it is difficult for a consumer to distinguish between domestic and foreign webshops, as both will be presented in the same, regional language. A consumersurvey approach thus leads to a severe downward bias in measuring crossborder online consumption [Minges2016].
A solution suggested by unctad2016ecommerce is to use business surveys instead. For measuring crossborder online consumption of consumers in a single country, large companies within the EU should report their sales to consumers per EU member state. This places a huge burden on companies. Moreover, the approach poses significant challenges on any correction for sampling probabilities and biases if, for example, the population of the existing ICT business survey would be used. The referenced population is the result of sampling and stratification with respect to (a) economic activity and (b) either relative turnover or number of employees. The stratified sampling probabilities with respect to size in one country have to be transformed into that of the total online sales in another country. Given large differences between countries in this regard, it seems infeasible to arrive at accurate estimations for crossborder online consumption for each EU member state using the suggested solution. It thus renders traditional official statistical methods inadequate.
Based on the shortcomings of traditional methods, an adequate method to measure crossborder online consumption should at least meet the following three requirements. First, the method should be based on supplyside data, to prevent a strong downward bias due to a webshop’s availability in different languages. Second, it should be based on existing data that were collected to accurately measure the sales of companies across borders. A reliable administrative or other integral data source would be preferable, to prevent dealing with sampling issues. Third, the data would have to be available to national statistical institutes across the European Union in order to longitudinally monitor crossborder online consumption.
Motivated by these three requirements, we propose to use supplyside data in the form of tax returns filed by foreign companies. The EU system of value added tax (VAT) states that any company established in the EU that is involved with crossborder intracommunity supplies to consumers has to pay VAT in the country of destination, through filing a tax return (European Commission, Council Directive 2006/112/EC). The threshold value on total turnover from sales to consumers, above which filing a tax return is mandatory, is either EUR 35,000 or EUR 100,000, depending on the country of destination. For foreign companies selling to consumers in the Netherlands, the threshold value on total sales in the Netherlands equals EUR 100,000. The Dutch Tax and Customs Administration collects such tax returns, which are then made available to Statistics Netherlands. Here we note that using these data restricts us to measuring crossborder online consumption of goods (henceforth referred to as crossborder internet purchases). The approach that we will propose cannot be applied to measure crossborder online consumption of services.
The main challenge in using filed tax returns is to identify webshops. For this identification, we propose a twophase approach. In the first phase, the aim is to select the companies that are economically active in retail trade, according to the NACE (Rev. 2) classification of industries. NACE is the acronym for Nomenclature statistique des Activités économiques dans la Communauté Européenne
and is thus the "statistical classification of economic activities in the European Community". The filed tax returns do contain a classification of economic activity for each company. However, it is based on a statistical classification of industries as of 1974. As webshops did not exist in 1974, the classification is inadequate in identifying webshops. We propose a data fusion approach, merging the companies filing tax returns with a complete list of European companies active in retail trade. The list can be obtained from any business register (BR) containing the company names, country of establishment and the NACE classification of industries for every company in the EU. We choose to use the global ORBIS database (
http://bvdinfo.com/orbis), maintained by Bureau van Dijk, as it is an open data source. We remark that the legal name of a company can be registered differently in the BR compared to the filed tax returns. As a result, merging legal company names from the two datasets is not straightforward due to challenges related to record linkage. Mostly, only three types of differences occur. The first type of difference results from different uses of nonalphanumeric characters, such as a period (), a hyphen () or an ampersand (&). The second type of difference is the notation (abbreviated or fully written out) of the type of business entity, at the end of a legal company name. An example is ltd instead of limited. The third type of difference is spelling, in particular for nonASCII characters such as the German ü, which occurs either as ue or simply u. Other common differences in spelling are the use or omission of white spaces, such as writing web shop instead of webshop. Other types of occurring differences, such as an updated company name which has been processed in only one of the two databases, are more complicated. We use text mining and datadriven recordlinkage techniques to neutralize the first three types of differences. A significant part of the methods section describes this first phase.A second phase is necessary, because being a foreign company active in retail trade, exporting to the Netherlands is not equivalent to being a foreign webshop, exporting to the Netherlands. We might assume that a foreign webshop that sells goods to consumers in the Netherlands will file a tax return (if the annual turnover is above the threshold value) and will be registered as a retail company in the BR, either as principal or secondary economic activity. However, a foreign company that files a tax return in the Netherlands and that is registered as a retail company in the BR, might not necessarily be active as an online retailer in the Netherlands. As a result, merging tax returns and the list of retail companies in the BR might lead to false positives. Therefore, we propose a second phase based on internet data. This phase roughly consists of two steps. The first step is to find the web page of a company based on the legal company name. The second step is to assess whether the web page belongs to a webshop, using textbased features obtained from the HTMLcode of the web page.
In the implementation of the two phases described above, fine tuning of parameters is still required. We provide for phase an example. In the first phase, for example, it has to be specified how similar two legal company names should be in order to be considered identical. It requires choosing an optimal threshold value for the similarity score of two legal company names. In the second phase, for example, a function has to be chosen that maps textbased features obtained from the data to a binary categorization, namely webshop or not a webshop. In both phases, the fine tuning of parameters is achieved by using machine learning techniques. The challenge in using machine learning is choosing a suitable algorithm from the wide variety of machine learning algorithms that exist. A solution is to compare the goodness of fit of a variety of machine learning algorithms. The algorithm that performs best according to the goodness of fit on the data is chosen to fine tune the parameters.
The proposed approach is a combination of using supplyside data measuring internet sales and datadriven methods to identify foreign webshops. We claim that the proposed approach yields more accurate estimates of crossborder internet purchases than existing or earlier suggested approaches. It should be noted that the proposed approach might still result in an estimate of crossborder internet purchases with a small downward bias, for which two reasons can be identified. First, companies with sales below the threshold value in the country of destination do not have to file a tax return. The internet purchases at such small companies are therefore missing in a measurement based on filed tax returns. We remark that there may exist many companies of this type. Second, the reported turnover from sales to consumers might be inaccurate, potentially leading to an underestimation of the total crossborder internet purchases. The second effect, however, is expected to be minimal, due to strict law enforcement by and collaboration between tax authorities in the EU. Whatever the cause, we do not aim to correct for any of these two biases, as there is no data available to estimate them. Moreover, our main aim is to demonstrate the severe downward bias of consumersurvey approaches compared to a supplyside approach in estimating crossborder internet purchases within the EU. We therefore do not mind if our supplyside approach still yields a conservative estimate.
The remainder of the paper is organized as follows. In Section 2, we describe the three supplyside datasets that have been used. We also describe how the training and test datasets needed for the machine learning algorithms are obtained. In Section 3, the datadriven methods to identify foreign webshops are thoroughly discussed. In Section 4, we present the results of applying the proposed approach to the economy of the Netherlands. In addition, the results are compared with results from an existing consumerbased approach to measure crossborder internet purchases in the Netherlands. Section 5 concludes by encouraging the implementation of our method in other countries within the EU and discussing possible further research.
2 Data
In this section, we will describe the three supplyside datasets that are used to measure crossborder internet purchases. In Subsection 2.1, the turnover distribution is depicted and some summary statistics are mentioned. In Subsection 2.2, we explain how the training and validation dataset were constructed. In Subsection 2.3, we describe the test dataset.
2.1 Data Description
The data used to measure crossborder internet purchases are tax returns filed in the Netherlands by foreign companies established in the EU. This dataset of tax returns contains legal company names and the annual turnover from sales in the Netherlands of goods taxed at low or high tariff. The data are extracted from tax returns filed for 2014, 2015 and 2016. The dataset contains 197,424 filed tax returns from unique companies. Logically, this dataset of tax returns filed in the Netherlands is not openly available, due to strict privacy legislation. Under severe restrictions (among others, anonymizing) and obeying serious impositions (such as suppressing extreme values), we are permitted to present aggregated figures on the data. When relevant, we reveal the criterion for which we suppressed information. Hence, in Fig. 1 we show the distribution of the annual turnover for each of the years 2014, 2015 and 2016. Furthermore, Table 1 displays summary statistics of the dataset of tax returns filed in the Netherlands.
Year  Mean  Median  10th perc  90th perc  

2014  16,023  8,969  86  6,968  1,904,860  25,987  1,755  1,365,217 
2015  17,313  9,771  80  7,462  1,603,908  25,166  1,783  1,218,926 
2016  18,939  10,626  104  8,209  1,488,351  24,790  1,879  1,143,156 
In addition, two open data sources have been used. The first open data source is the set of home pages of web sites of foreign companies that filed at least one tax return in the Netherlands for 2014, 2015 or 2016. The pages were downloaded on April 19, 2017.
The second open data source is ORBIS, a global corporate database maintained by Bureau van Dijk (http://bvdinfo.com/orbis). It contains detailed corporate information on over 200 million private companies worldwide. The database has been claimed to “suffer from some structural biases” [Ribeiro et al.2010]. However, for European companies having more than EUR 100,000 annual turnover, the dataset is close to being complete [GarciaBernardo and Takes2017]. Data on smaller foreign companies are not needed in our analysis, as they do not have to file tax returns in the Netherlands. The ORBIS database is used, because it contains the principal and secondary NACE (Rev. 2) codes for companies established in the EU. The NACE code can be used to select all active and inactive companies established in the EU that are principally or secondarily economically active in retail trade. The result is a dataset of companies, in which companies established in the Netherlands are excluded. This dataset, including each company’s country of establishment, was extracted from ORBIS on June 24, 2017.
For our purposes, any business register (BR) containing the company names and country of establishment of every retail company in the EU would suffice, as long as the retail companies (according to NACE Rev. 2) can be identified as such. Therefore, we will henceforth refer to the ORBIS data as “the BR”.
2.2 Training and Validation Dataset
In order to train classification algorithms, a labeled dataset is required. Since no such dataset existed, it was manually constructed as follows. The filed tax returns contain a classification of economic activity according to the (outdated) statistical classification of industries as of 1974. At first glance, most webshops seemed to be classified as
Retail Trade, many as Wholesale trade and some as another type of industry according to the outdated classification. We constructed a training dataset of 180 companies by manually categorizing all companies of which the total annual turnover in at least one year exceeded an industrydependent threshold value (see Table 2, column 2). Table 2 (column 3) displays the number of companies manually categorized per type of industry. In fact, two manual categorizations were made for each company included in the training dataset as presented in Table 2. The first categorization is whether the company is economically active as a retail company according to the BR. We remark that the economic activity reported in the BR might be different from the one found in the tax return dataset. The second categorization is whether the company is actually a webshop or not, based on manually searching the internet.Within the training dataset, 76 webshops were identified. Their total turnover in 2016 was equal to EUR 724,542,550. The validation dataset will be obtained from the training dataset when applying stratified fold crossvalidation. It is described in more detail in Section 3.1.
Industry (1974)  Threshold  Count 

Retail Trade  EUR 1 million  100 
Wholesale Trade  EUR 20 million  30 
Other  EUR 50 million  50 
Total  –  180 
2.3 Test Dataset
To assess the goodness of fit of a classification algorithm, an additional labeled ‘test’ dataset is required. We constructed a test dataset by manually categorizing a set of 80 randomly selected companies not occurring in the training dataset. (One duplicate retail company had to be removed resulting in 79 companies in the test dataset.) During the sampling, the relative frequency of the type of industry in the entire dataset was taken into account. Table 3 shows the frequencies and the total number of webshops found among the 79 companies, namely 13.
Industry (1974)  Total Frequency  Count  Webshop Count 

Retail  1,393  19  6 
Wholesale  3,329  20  1 
Other  17,718  40  6 
Total  22,440  79  13 
3 Methods
In this section, we will discuss the datadriven methods we have used to estimate crossborder internet purchases within the EU by Dutch consumers. The methods are described in such a way that they can be generalized to other countries within the EU. We present the methods in three parts. In Section 3.1, the methods used to identify foreign webshops in the dataset of tax returns are specified. This first section, called Estimating the industry class, contains the most important methodological contributions of the paper. Section 3.2, titled Bias correction, outlines how we have corrected for biases introduced by inaccuracies in the methods used to identify foreign webshops. In Section 3.3, we briefly summarize our proposed approach for measuring crossborder internet purchases within the EU. Moreover, we motivate why we speak of a datadriven supplyside approach.
3.1 Estimating the Industry Class
The general setup is as follows. Consider a population of companies indexed by a set . For a company , the industry class is denoted by . The set consists only of two industry classes, namely webshops () and other companies (). The total turnover from sales of goods taxed at the low or high tariff as reported in the tax returns in year by company is denoted by . In the data we have available, is given for each and each . The goal of the paper is to estimate
(1) 
for each year . In the above, denotes the indicator function.
The challenge of estimating expression (1) is that the industry classes , , are not observed and thus have to be estimated. We propose to estimate in two different ways and combine the two estimates into a single estimate of . The first way to estimate is by business registers (BR). This is the main topic of Subsection 3.1.1. The second way to estimate is by web sites. This is discussed in Subsection 3.1.2. In Subsection 3.1.3, we propose how to combine the two estimates of into a single estimate. This final, ‘combined’ estimate of is used to evaluate (1).
3.1.1 Estimating the Industry Class by Business Registers
This subsection describes how to estimate the industry class of a company by business registers. The estimate will be denoted by . To estimate the industry class by business registers, we assume that a webshop can be identified by evaluating whether it is registered in the business register (BR) as a retail company (according to the NACE (Rev. 2) code) in the country in which the office filing the tax return is established. More specifically, we assume that the sales of goods taxed at the low or high tariff as reported in the tax return by companies registered as a retail company in the BR are precisely the crossborder internet purchases within the EU by the consumers of the EU country under consideration. This assumption is used by checking for each whether it is registered as a retail company in the BR. If so, , otherwise, . In other words, we aim to find the intersection of (1) the companies in and (2) the companies in the BR registered as a retail company.
The challenge is that the only available data to find the intersection are the names of the companies in each set. There are four main problems in finding the intersection using company names.

The type of business entity might be registered differently in both datasets. A typical example is ltd versus limited.

The name of a company might be spelled differently in both datasets. A typical example occurs for German names, such as muller versus mueller.

The datasets we use are relatively large. Finding the intersection based on (slightly differently spelled) names is computationally expensive.

Finding the intersection using (slightly differently spelled) names is an optimization problem: one has to determine a threshold on the permitted number of spelling differences between names belonging to the same company.
The following four paragraphs propose solutions for these four problems.
A. Stemming Company Names
The first problem we aim to solve is that the type of business entity of the same company might be registered differently in both datasets. The two datasets we refer to are the set of legal company names in filed tax returns and the set of legal company names of EU retail companies in the BR. To be able to compare the type of business entity of two company names, we first have to isolate the type of business entity from the company name. The type of business entity can be isolated by a technique called stemming, as the type of business entity is usually denoted at the end of a legal company name. Before stemming is applied, all nonalphanumeric characters in the elements of both and are replaced by white spaces. Then, all leading, trailing and duplicate white spaces are removed. Denote again and by the resultant sets.
Next, stemming is applied on all company names in and . Our threestep approach is based on the first published stemming algorithm [Lovins1968] as well as on Porter’s stemming algorithm [Porter1980]. The latter still is “the most common algorithm for stemming English” [Manning et al.2009, p. 32]. In the first step, a countrydependent list of most common endofstring words, or suffixes, of variable length in the BR is constructed. Only the BR is used, as the dataset of filed tax returns does not contain a company’s country of establishment. As the names of types of business entities strongly differ per country, the types of business entities in smaller countries might not show up as most common suffixes. It is assumed that a suffix, if present, consists of at least one and at most four words. The most common suffixes are complemented with a list of known types of business entities per EU member state, obtained from https://en.wikipedia.org/wiki/List_of_business_entities. In the second step, the obtained list of suffixes is reduced to a list of most common starts of suffixes. It not only reduces the computation time of removing a suffix, but it also increases the number of suffixes found. This step is inspired by Porter’s algorithm for stemming English [Porter1980]. In the third and final step, suffixes are removed from company names in both lists by searching for any of the most common starts of suffixes. It is ensured that the remaining stems of company names are never empty.
The above defines a twovalued function , yielding for any name the stem of and the start of the type of business entity of . Define a relation on the image of , where if is an abbreviation of one of the types of business entities is the start of, or vice versa. The relation is reflective and symmetric, but not necessarily transitive. The class of an element in the image of is defined as . The class is referred to as the suffix class of . Now, the aim is to find for each all for which and are very similar and . That is, determine whether is similar to any element of . In other words, we are looking for company names having similar stems and identical suffix classes.
B. Approximate String Matching
The second problem is that the stems might be spelled differently in both datasets. A typical example would be webshop ltd versus web shop ltd or webshop ltd. Another typical difference occurs in, for example, German names, such as muller versus mueller. The variety of possible differences in spelling is huge. The common solution is referred to as approximate string matching.
The idea behind approximate string matching between two sets and of strings, is to determine for each element in the closest element in according to some string distance metric . An example of such a string distance metric is the Jaccard distance on character grams, or shingles [Leskovec et al.2014, Chapter 3]. A character gram is defined as a substring of consecutive characters in a string. As an example, the set of character grams, or trigrams, of the string ‘webshop’ is the set {‘web’, ‘eb ’, ‘b s’, ‘ sh’, ‘sho’, ‘hop’}. For , write for the functions mapping a string to its set of character grams. The Jaccard distance on two sets and is defined as . Thus, the Jaccard distance on character grams between two strings and is defined as . It can be easily shown that is a metric for every . The approximate string match for an element in the set according to the metric would be (one of) the element(s) in minimizing . The minimum distance is commonly denoted by .
Now, two issues appear. First, the best approximate string match should be categorized as a match or as no match. It means that some threshold value for has to be determined. Second, how to choose , or, more generally, which string distance metric to choose?
We handle both issues simultaneously by considering multiple string distance metrics and letting a machine learning algorithm determine the optimal threshold values. The string distance metrics considered are (1) normalized Levenshtein (or edit) distance, (2) JaroWinkler distance (originally proposed by winkler1990metric and in fact not a metric in the mathematical sense), (35) Jaccard distance on sets of character 1, 2 and 3grams, and (68) cosine distance on term frequency vectors of character 1, 2 and 3grams. The JaroWinkler, Jaccard and cosine distance are defined as 1 minus the corresponding string similarity measure and always take values in the interval
. The Levenshtein distance is normalized to the interval via dividing by the maximum length of the two input strings. All metrics are defined and compared by cohen2003comparison (see also leskovec2014mmds, leskovec2014mmds, pp. 8793, for a more recent discussion).At the end of this step, each company in the set of filed tax returns is equipped with an dimensional vector containing values in the interval quantifying the distance (along different metrics) to the set of EU retail companies in the BR. The values will be used as features in the machine learning algorithms as described in Paragraph D of Subsection 3.1.1.
Finally, we note that the bruteforce approach of computing as the minimum of for is computationally expensive, as it requires comparing to each . Under the description of problem C below, a more efficient and more elegant approach is described. It is preferred over the bruteforce approach.
C. MinHashing and LocalitySensitive Hashing
The third problem is the size of the datasets when performing approximate string matching. Each possible combination of a company name from the set of filed tax returns and a company name from the BR has to be assessed by multiple (relatively slow) approximate string matching algorithms. In our case, 22,440 6,996,468 157,000,000,000 comparisons are needed to evaluate a single approximate string matching algorithm. Depending on the approximate string matching algorithm, this can take up to several days. As we intend to combine the results of different approximate string matching functions, we aim to reduce the computation time.
An efficient and elegant approach is to use localitysensitive hashing (LSH). The concept of LSH is quite general. In short, it is a form of randomized dimensionality reduction. The approach we have used is outlined in bawa2005lsh. The authors introduce the data structure LSH Forest. It can be queried to retrieve for any object and any natural number , the approximately most similar objects in the input dataset according to any metric that induces a localitysensitive hashing family. One such metric is the Jaccard distance between sets, as introduced above in Paragraph B of Subsection 3.1.1.
Below, we briefly describe how the LSH Forest is constructed for the Jaccard distance on character trigrams. We follow the approach outlined in bawa2005lsh, where accompanying details, motivation and additional literature can be found.
Let be a string containing alphanumerical characters and white spaces only. Enumerate all possible characters trigrams containing only letters, numbers and white spaces. Replace the set of character trigrams by the set of corresponding characters trigram IDs, being the indexes from the enumeration. The randomized dimensionality reduction is to compute for each a bit minhash signature. The bit minhash signatures are computed as follows. First, randomly choose hash functions from the family of random linear functions of the form , with integers and a fixed, large prime number. Then, randomly choose hash functions mapping the values uniformly at random onto . The th bit of the bit minhash signature of is then given by , where is the vector containing the IDs of the character trigrams of .
The number is chosen as small as possible in such a way that every string for has a unique bit signature. A maximum of is imposed to prevent the signatures from becoming too large. The LSH Tree is defined as the logical prefix tree on all bit signatures. Note that the leaf nodes correspond to the points , and that the bit signature of specifies the path through the tree to the leaf node corresponding to . The LSH Forest consists of LSH Trees, each constructed with an independently drawn random sequence of hash functions from the described family of hash functions.
Given the stem of a company name , each of the LSH Trees is updated with an additional leaf node containing (the end point of the path through the LSH Tree specified by the bit signature of) . The LSH Trees are then simultaneously searched bottomup starting from the new leaf node, until the most similar items are identified.
In our analysis, the function MinHashLSHForest from the Python library datasketch is used (https://github.com/ekzhu/datasketch). The total number of hash functions is fixed to be 64 and the number of LSH Trees was set to the default value . The datasketch implementation then fixes for the length of the minhash signatures used to build each of the LSH Trees. The choice of is relatively small, but works already rather well in our case, as shown in Section 4.1. The top most similar leaf nodes from the LSH Forest are returned for each . Across this set of 100 approximate most similar stems, the closest stems according to each of the 8 string distance metrics as chosen in Paragraph B of Subsection 3.1.1 are computed.
D. Machine Learning
The fourth problem is choosing an optimal threshold of how different two names belonging to the same company are allowed to be. We propose to use machine learning to solve this problem.
More precisely, the aim is to find a classification algorithm that can accurately predict the industry class using the dimensional vectors of distances as constructed in Subsection 3.1.1 so far. The true, unobserved binary value captures whether company is registered in the BR as a retail company. Recall that for each company in both the training set and the test set, the class was observed by manually searching the BR.
To select a classification algorithm and corresponding algorithm parameter settings that are optimal in predicting , the following datadriven approach was chosen. First, ten classification algorithms are chosen to be examined. The ten classification algorithms that we consider are depicted in Table 4
. We consider the linear classification algorithms Logistic Regression (LR), Linear Discriminant Analysis (LDA) and Linear Support Vector Classification (LinSVC). The nonlinear algorithms implemented are kNearest Neighbours (kNN), Multinomial Naive Bayes (MNB), Quadratic Discriminant Analysis (QDA) and Support Vector Classification with Radial Basis Function Kernel (RBFSVC). Further more, we examine three ensemble algorithms, namely Random Forest (RF), Gradient Boosting (GB) and AdaBoost (AB). The algorithms will mostly be referred to using the acronyms between brackets. The details on the specifications of the classification algorithms can be found in, e.g., witten2016datamining, han2011datamining or hastie2009statlearn. We have used the Python library scikitlearn (
http://scikitlearn.org/, version 0.19.1) to implement the ten classification algorithms.Type  Algorithm (Acronym) 

Linear  Logistic Regression (LR) 
Linear Discriminant Analysis (LDA)  
Linear Support Vector Classification (LinSVC)  
Nonlinear  kNearest Neighbours (kNN) 
Multinomial Naive Bayes (MNB)  
Quadratic Discriminant Analysis (QDA)  
Support Vector Classification with Radial Basis Function Kernel (RBFSVC)  
Ensemble  Random Forest (RF) 
Gradient Boosting (GB)  
AdaBoost (AB) 
Then, we specified for each of the ten algorithms a grid of parameter settings to be examined. These grids are depicted in Table 5. See the scikitlearn documentation for precise parameter specifications.
Now, for each algorithm and each parameter setting in the parameter grid, stratified 5fold crossvalidation is performed on the training dataset. Crossvalidation is used to prevent overfitting. The choice of using 5 folds is based on breiman1992kfoldcrossval, although it might introduce more variance than choosing 10 or 20 folds
[Kohavi1995]. However, due to the small size of the training dataset, choosing 10 or 20 folds might lead to unstable results. Therefore, we have chosen to use stratified 5fold crossvalidation in order to reduce the variance, as suggested by kohavi1995stratcrossval. Furthermore, we optimize parameter settings using mean F1 scores over the 5 folds. We prefer F1 over accuracy due to the low base rate of webshops occurring in the entire dataset. We do not use the common metric AUROC to optimize parameter settings, as it is known to possibly mask poor performance when facing imbalanced data [Jeni et al.2013]. As our data is in fact strongly imbalanced, due to the low base rate of webshops, it does not seem wise to use AUROC as optimizing metric. Moreover, optimizing AUROC does not, in general, imply optimizing F1 [Davis and Goadrich2006]. Thus, for each algorithm the parameter setting that maximizes the mean F1 score over the five folds is selected. Subsequently, the mean and standard deviation of F1 scores over the five folds between the ten optimal classification algorithms are compared.
Finally, both the mean F1 score and the standard deviation of F1 scores over the five folds are considered in choosing the final classification algorithm and corresponding parameter settings. If necessary, the local behaviour on the parameter grid is examined to reduce the standard deviation of F1 scores over the five folds. This final classification algorithm is then trained on the entire training dataset. The trained classification algorithm is used to compute the estimate for each company not included in the training dataset. Recall that is an estimate for , which indicates whether company is registered as a retail company in the BR. In practice, it might be different from the true industry class that we aim to estimate. This is the reason that we will introduce a second estimate in Subsection 3.1.2. The two estimates and are combined into a single estimate in Subsection 3.1.3.
Algorithm  Parameter Grid 

LR  penalty : l1, l2; : 
LDA  nonparametric 
LinSVC  : 
kNN  : 
MNB  : 
QDA  nonparametric 
RBFSVC  : ; : 
RF  : ; : 
GB  : ; : ; : 
AB  : , : ; : 
3.1.2 Estimating the Industry Class by Web Sites
This section describes how to estimate the industry class of a company by web sites. The estimate will be denoted by . To estimate the industry class by web sites, we assume that a webshop can be identified by a shopping cart on the home page, referred to as such in the underlying HTML code. If a shopping cart is found, , otherwise, .
The main challenge is that filed tax returns do not contain the URL of the web site of a company. We have therefore implemented a method for finding the URL of the web site of a company based on the legal company name. Web scraping is used to automatically look for a shopping cart on the web site. The two tasks are, however, not flawless. Therefore, an additional machine learning approach is used to minimize errors. To summarize, we thus proceed in three steps.

Find the URL of a company based on the legal company name.

Scrape the HTML code of the URL found in step I and specifically look for a shopping cart.

Use the results from step II in a classification algorithm in order to estimate .
The following three paragraphs describe the three steps.
I. Finding a Company’s Web Site
The first step is to find the URL of the home page of each foreign company filing tax returns. A URL might be available in filed tax returns in some countries. In tax returns filed in the Netherlands, it is not. Therefore, at Statistics Netherlands, software has been developed to find the URL of the home page of a company based on the legal company name. The software, referred to as URLfinding, uses Google’s Search API. The input is a list of legal company names and the output is a list of URLs per company, ranked according to a score between 0 (definitely not a match) and 1 (definitely a match). The score is referred to as the matching probability, however note that it should not be interpreted as an actual probability in the statistical sense (see () below). For each company, the URL with the highest matching probability is selected. Both the URL and the corresponding matching probability is returned.
Recalling () above, we remark that the matching probability is computed using the Random Forest algorithm. The features used are the JaroWinkler similarities [Winkler1990] between the legal company name and some textbased features of the web page, such as the website address, its title and its description. The training set consists of a few thousand Dutch companies from different industries of different sizes, in terms of number of employees. We explicitly note that the Dutch language of the training set is not necessarily an issue, as discussed in Section 1. The software has not been developed by the authors and is currently under construction.
II. Searching for a Shopping Cart
The second step is to search the identified web site for a shopping cart. For each URL returned by URLfinding the HTML code is downloaded as a raw text file. In the raw text, the occurrences of the words shop, cart and basket in Dutch and English and the word shopping cart in German are counted. The full list is winkel, wagen, mand, shop, cart, bag, basket, warenkorb. The choice of these three languages is based on the fact that most Dutch citizens mostly speak only Dutch, English and/or German. Note that in modern information retrieval, it is more common to count the occurrences of all words found in a document (cf. chapter 6 in manning2009IR). We have chosen not to follow this approach, as it would lead to serious dimensionality issues: the number of different words (features) would be much larger than the number of documents (companies in the training set).
At the end of this step, each company in the set of filed tax returns is equipped with a vector containing the maximum matching probability and the counts of the eight words mentioned above. The values will be used to train a machine learning algorithms as described in the following paragraph.
III. Machine Learning
The third step is to find a classification algorithm that can accurately predict the industry class , using the eight word counts and the maximum matching probability as described in step II above. The true, unobserved industry class represents whether company is a webshop. We recall that for each company in both the training and test set, the class was observed by manually searching the internet.
Before training a classification algorithm, the counts of the words are transformed to real numbers in the interval using (normalized) termfrequency inversedocumentfrequency (TF.IDF) (see witten2016datamining, witten2016datamining, p.314, for a definition). To prevent division by 0 in computing the IDF, a single document containing each of the eight words once is added to the data. The eight TF.IDF values and the maximum matching probability are used as features in fitting classification algorithms on the training data.
The machine learning approach is identical to that described in Paragraph D of Subsection 3.1.1. One additional restriction was imposed: if either the maximum matching probability was below or no HTML code could be downloaded, the company was removed from the dataset used to train, test and finally compute . For these removed companies, the value of was set to , to be interpreted as ‘missing’.
3.1.3 Constructing the Final Estimate of the Industry Class
The two selected classification algorithms, each with the optimal parameter setting, are trained on the entire training dataset. The trained models are used to compute and on the remaining part of the dataset. Companies whose features, needed for one of the two algorithms, are (partially) missing receive the value , to be interpreted as ‘missing’, as prediction. It happens for if the taxstem of the company has less than three characters. It happens for if the maximum matching probability is below or no HTML code was downloaded. The final, single categorization is obtained by combining and as follows:
The ANDoperator is computed as the minimum of the two integers. It implies that categorizes a company as a webshop if and only if the company is categorized as such by both and .
Finally, expression (1) can be evaluated using as estimate for . This was was the main purpose of Section 3.1. A final note is that the manual categorization is used instead of the model for companies in the training set. We denote by the set of companies in the training set, being manually categorized. The total (annual) crossborder internet purchases as in expression (1) are thus estimated by
(2) 
The modelled introduces a bias in estimating expression (1) by expression (2). The bias is introduced by the second term of expression (2) above. The derivation of the bias and a method to correct for this bias are discussed in Section 3.2.
3.2 Bias Correction
As mentioned at the end of Subsection 3.1.3, the estimate in expression (2) of total crossborder internet purchases is biased as an estimator for expression (1). Section 3.2 has two goals. The first goal is to derive an expression for this bias and propose a method to correct for this bias. The second goal is to estimate the standard deviation of the final estimate of crossborder internet purchases.
For achieving both goals, we follow the approach of delden2016errors_published. The notation used so far is obtained from their paper. We will also introduce the vector notation from their paper. We write for the 2vector and consider the aggregate turnover vector . Similarly, define based on . Expression (2) will thus become the first component of the estimated 2vector given by
(3) 
In the remainder of Section 3.2 only the subset is considered. Hence, any index will refer to a company not in the training set . Consequently, the estimate will be used to refer only to the second term in the righthand side of equation (3), as the first term does not introduce any bias. Similarly, will be used in the remainder of Section 3.2 to refer to .
The approach of delden2016errors_published entails that is considered to be deterministic and to be stochastic. Consider the classificationerror model
We assume that does not depend on . This assumption might be argued to be incorrect for two reasons. First, it is more difficult to find the correct web site for a small company than for a large company. Moreover, a small company that is not a webshop might not even have a web site. Second, the coverage and quality of the BR for smaller companies is significantly lower than for larger companies. Both reasons imply that the probability of a classification error (more specifically, a false negative classification error) increases as turnover decreases. However, estimating for different turnover classes, as suggested by delden2016errors_published, requires a far larger training dataset, which is quite labour intensive. Moreover, it would also require some large companies not
to be included in the training dataset. We do not believe that this is wise in our case, as the turnover distribution is highly skewed, as depicted in Fig.
1 in Section 2.1. Therefore, we conclude that the accuracy of the estimation relies more heavily on accurately classifying larger companies than on accurately estimating .The resulting 22matrix is estimated as follows. On the test dataset, is compared to . Denoting by TP, FP, TN, FN the number of true and false positives and true and false negatives, respectively, the estimator for takes the form
For ease of notation, we will write to refer to the estimated .
Now, the classificationerror model implies and hence . The bias of as an estimator for equals , where is the 2
2identity matrix. This estimator is biased as soon as the classification algorithm makes any mistakes (meaning
) andis not the eigenvector of
corresponding to eigenvalue 1. It implies that in almost all cases,
is indeed a biased estimator for as soon as the classification algorithm makes any mistakes. Denoting by the inverse of , it follows thatis an unbiased estimator for
and is an unbiased estimator of .However, correcting with might increase the variance of the estimator. It might lead to low accuracy in practice. The suggested approach is as follows. Construct bootstrap estimators , , by applying the transition matrix to the predicted . It entails that realisations of the alternative classificationerror model given by
are considered, where is the industry class corresponding to the 2vector . It results in bootstrap replications , , of the estimated total turnover. The bootstrap bias and variance are given by
and
where is the sample mean. In the limit of , we find
and
in which . The function diag maps an vector to a matrix with the values of on the diagonal and zeros as offdiagonal elements. Now, in this limit of both bootstrap estimators are biased. The bias of as an estimator for the variance of can be shown to be relatively small and is therefore not corrected [Van Delden et al.2015, Appendix A4]. The bias of as an estimator for the bias of is precisely . A second bootstrap estimator is defined, resulting in the bootstrap replications , of estimated total turnover. Write and . The vector is an unbiased estimator of , but might in practice have a larger variance than the biased estimator . To obtain the most accurate results, the goal is to find the optimal value such that the linear combination , , of and minimizes the mean squared error of the first component of as an estimator for the bias . The mean squared error is given by:
The following iterative approach is suggested by delden2016errors_published.

Start with .

Compute and .

Compute , where

If , stop and return . Otherwise, set
Set and return to step 2.

The final estimate of is computed by subtracting from
Details of the derivation of the formulas in step 2 can be found in Appendix A3 in delden2015errors_details. Note that we do not explicitly construct the bootstrap estimators and , but use the analytical expressions for and . For large , both approaches will yield the same result, but the latter is computationally more efficient. It should also be noted that the above iterative procedure is performed for each of the years 2014, 2015 and 2016 separately. The same (estimated) matrix is used for each year. The optimal value of might differ across years, as it depends on the annual turnover figures.
We conclude Section 3.2 by demonstrating how to estimate the standard deviation of the final estimate of . First, observe that the estimate for the optimal value of can explicitly be written as
Then, it follows that the covariance matrix of can be written as
The covariance matrix of can be estimated by
in which, again, . Finally, the standard deviation of the first component of will be estimated by the square root of the value in position in the matrix
As the values of are not stochastic, this concludes the derivation of the standard deviation of the final estimate of .
3.3 The Proposed DataDriven SupplySide Approach
The proposed datadriven supplyside approach for measuring crossborder internet purchases within the EU can be summarized as follows. Based on EU VAT legislation, the starting point is a dataset of tax returns filed by foreign companies established within the EU. Tax returns are supplyside data as they contains company sales. Then, the challenge was to identify webshops within the dataset of tax returns. We solved this problem in two phases. In the first phase, we implemented approximate string matching techniques to merge the set of tax returns to a business register of retail companies established within the EU. The merging can be viewed as datadriven recordlinkage, as we optimized the performance of the approximate string matching using machine learning algorithms. In the second phase, we used web scraping in combination with (datadriven) machine learning to assess whether a company is a webshop. The outcomes of the two phases are combined to obtain a more accurate estimate of crossborder internet purchases. Moreover, we use the data to estimate the bias and standard deviation of the estimate. Thus, the proposed methods applied on the proposed data yield our datadriven supplyside approach for measuring crossborder internet purchases within the EU.
4 Results
The goal of this section is to present our main findings in applying the proposed approach to estimate crossborder internet purchases within the EU by Dutch consumers. The section is structured as follows. In Section 4.1, the results of training the classification algorithms to estimate the industry class by business register are presented. In Section 4.2, the same is presented for estimating the industry class by web sites. In Section 4.3, we present the results of estimating crossborder internet purchases by Dutch consumers. It contains the most relevant results of the paper. Finally, in Section 4.4, we compare our results to currently available estimates of crossborder internet purchases by Dutch consumers. We interpret and discuss the differences of the estimates.
4.1 Results from Estimating the Industry Class by Business Registers
In this section, we aim to present the main findings of estimating the industry class by business registers. Recall from Paragraph D in Subsection 3.1.1 that we compared ten different machine learning algorithms (Table 4), each evaluated using multiple parameter settings (Table 5). For each algorithm, we have selected the parameter settings that are optimal in predicting , based on the mean F1 score from the stratified 5fold crossvalidation. The results for Multinomial Naive Bayes are not shown, as this algorithm assumes discrete, multinomially distributed features, while the features are distances between strings and thus continuous.
The optimal parameter settings and corresponding scoring metrics for training the specified classification algorithms to estimate are presented in Table 6. The algorithms are ranked with respect to the optimal mean F1 score over the folds in the stratified 5fold crossvalidation. The standard deviation in scores over the five folds is shown in parentheses. Note that the differences in mean goodness of fit between the algorithms are small. Furthermore, the standard deviations in scores over the folds are small.
The algorithm we choose to use is Support Vector Classification with Radial Basis Function Kernel (RBFSVC), with parameters , and the balanced class weighing scheme. Observe that this choice not only maximizes mean F1score, but also mean precision and mean recall. In particular, the algorithm does not falsely predict positives on the training dataset. Moreover, the local behaviour of the mean F1 score of RBFSVC, as a function of the parameters and , is stable around the optimal parameters. Details on the local behaviour can be found in Appendix A.
Algorithm  Optimal Parameters  F1  Precision  Recall 

RBFSVC  ,  
GB  , ,  
kNN  
LinSVC  
LDA  
LR  , L1penalty  
AB  , ,  
RF  ,  
QDA 
4.2 Results from Estimating the Industry Class by Web Sites
Similar to Section 4.1, the aim of this section is to present the main findings of estimating the industry class by web sites. The optimal parameter settings and corresponding scoring metrics for training the specified algorithms to predict are presented in Table 7. Again, the algorithms are ranked with respect to the optimal mean F1 score over the folds in the stratified 5fold crossvalidation. The standard deviation in scores over the five folds is shown in parentheses. Observe that Multinomial Naive Bayes (MNB) and Quadratic Discriminant Analysis (QDA) perform considerably less well than the other algorithms. Moreover, observe that the linear algorithms (LR, LinSVC, LDA) perform less well than the remaining algorithms. It suggests that a linear separation of the data points in higher dimensional space does not yield the best classification on unseen data. Note that the ensemble methods (AB, GB, RF) outperform the other methods. Furthermore, note that the scores for predicting are much lower than those for predicting . It should also be noted that the standard deviations in scores are relatively high. Both might be due to the fact that in many cases, the URLfinding software does not return a URL that is likely to correspond to the input legal company name.
The algorithm we choose to use is Random Forest (RF), with parameters , and the balanced class weighing scheme. The reason for this choice is that RF maximizes mean precision. Moreover, the local behaviour of the F1 score of RF, as a function of the algorithm parameters, is more stable around the optimal parameters compared to the local behaviour for AB and GB. Details can be found in Appendix A.
Algorithm  Optimal Parameters  F1  Precision  Recall 

AB  , , , bal.  
GB  , ,  
RF  , , bal.  
kNN  
RBFSVC  , , bal.  
LR  , L1penalty  
LinSVC  
LDA  
MNB  
QDA 
4.3 Estimating CrossBorder Internet Purchases
This section contains the most relevant results of the paper, as it presents, in Table 8, the final estimates of crossborder internet purchases within the EU by Dutch consumers.
Recall that to obtain the results of this section, the algorithm chosen in Section 4.1 was (re)trained on the entire training dataset indexed by . It resulted in a model that was qualified to predict on the remaining part of the dataset of tax returns, indexed by . Similarly, a model was trained using the algorithm chosen in Section 4.2. The two models were combined into a final model as described in Subsection 3.1.3. The comparison between the model and the true, observed values on the test dataset, yields the values TP = 8, FP = 4, TN = 62 and FN = 5. It follows that
The main results of the paper are shown in Table 8. The values contain the total crossborder internet purchases at companies in the set . The categorization for companies in has been manually determined and can be considered free from errors. The values contain the additional estimated crossborder internet purchases at companies in the set . The values contain the optimal values of in minimizing the mean squared error of the estimated bias of . Note that all optimal values of are equal to 0, meaning that the increased variance dominates the decreased squared bias of compared to . This is due to the relatively high offdiagonal values in the matrix . The values represent the estimated bias of for the optimal value . Note that the bias strongly differs across the three years. The values show the final estimate for the total crossborder internet purchases, computed as . The final column in Table 8 contains the standard deviation of , estimated as outlined at the end of Section 3.2.
In the Netherlands, total household consumption on retail goods (food and durable goods, codes 1000 up until and including 3000) in 2016 is equal to EUR 87,206 million, according to Statistics Netherlands (https://opendata.cbs.nl). Statistics Netherlands does not publish the total online consumption of goods by Dutch consumers. The only currently available estimate is by Thuiswinkel.org and GfK and it is based on consumer surveys. The estimate for 2016 equals EUR 11.01 billion. It seems possible that just over 12 per cent of online consumption by Dutch consumers is spent at foreign webshops established within the EU. Besides, Statistics Netherlands does publish yearonyear growth figures on online retail sales by Dutch webshops. In 2016, this yearonyear growth was equal to 22.1 per cent. It is quite similar to the growth of 21.2 per cent that we find by comparing the values of in 2015 and 2016 as presented in Table 8.
Reflecting on our main findings, we note that the standard deviation of the final estimate would still be too large for official statistical purposes. However, as discussed more thoroughly in Section 4.4, our main findings prove to be a significant improvement compared to currently available alternative estimates.
Year  

2014  405  495  0  63  837  97 
2015  565  586  0  21  1,132  101 
2016  725  667  0  19  1,372  110 
4.4 Comparison with DemandSide Approach
In Section 1, we have claimed that a our datadriven supplyside approach would be more accurate than demandside approaches to estimate crossborder internet purchases within the EU. To support this claim, we compare our results for the Netherlands to the results of the consumersurvey approach by market research institute GfK (commissioned by Thuiswinkel.org on behalf of Ecommerce Europe). In 2016, total crossborder online consumption by Dutch consumers according to GfK was equal to EUR 637 million, EUR 190 million of which was spent in China and EUR 70 million in the United States. This implies that at most EUR 377 million was spent within the EU. Note that this figure includes online consumption of both goods and services.
Moveover, the fraction of online consumption of goods in the total online consumption in 2016, as reported by GfK, was EUR 11.01 billion / EUR 20.16 billion = 0.55. We assume that this proportion is independent of the location of the purchased goods or services. As a result, crossborder online purchases of goods within the EU, according to GfK, would approximately equal EUR 206 million in 2016.
We, however, find EUR 1,372 million for 2016 with a standard deviation of EUR 110 million. The estimate is more that 6 times as high as that of GfK. Our results thus show the severe downward bias in using demandside approaches to estimate crossborder online consumption and it motivates the implementation of our approach in other EU member states.
5 Main Conclusion and Future Work
We have proposed to use tax returns, business registers and internet data to measure crossborder internet purchases within the EU. We have implemented datadriven methods to combine these supplyside data in a computationally efficient manner. The empirical results show that the proposed approach leads to a strong improvement of existing approaches based on consumer surveys, as it overcomes the language bias as described in Section 1. In particular, we find that crossborder internet purchases by Dutch consumers within the EU in 2016 equal EUR 1,372 million. Market research institute GfK (commissioned by Thuiswinkel.org on behalf of Ecommerce Europe) estimate it to be approximately EUR 206 million based on consumer surveys. The language bias in consumer surveys thus results in a downward bias of a factor of more than 6 in crossborder internet purchases.
The proposed approach only requires foreign companies’ tax returns to contain the legal company name and the turnover taxed at high or low tariff. Due to EU VAT legislation these data are available in every EU member state. Note that we do not require that the economic activity of a company is accurately available in filed tax returns. The training and test set could even be constructed without any known economic activity, by viewing all companies as belonging to the same class and following the construction as described in Sections 2.2 and 2.3. Moreover, we do not assume that the URL of the home page of a company is available in filed tax returns. Furthermore, the additional data (the BR and internet data) required by the proposed approach are obtained from open data sources. Hence, here we arrive at our main conclusion, viz. that the proposed approach is expected to be applicable in any other EU member state and thus can estimate crossborder internet purchases within the EU.
Further applications of the proposed supplyside approach include revealing the structure of the crossborder online retail market in any EU member state. Our approach directly returns a list of foreign webshops and their annual crossborder internet sales to the observing country. If the information on domestic webshops active on the country’s online market is complemented, it can be used to analyse the market structure. A related research topic is the export of all webshops established in a single EU member state. It is related as it is the supplyside counterpart of the crossborder online consumption market within a country. It might be interesting to compare the two market structures within individual countries and compare the market structures between countries within the EU.
Future work on measuring crossborder internet purchases within the EU might focus on improving the websitebased predictions of company classifications. The empirical results show that it is the weakest part of the proposed approach, as the F1 scores for websitebased predictions are lower than the F1scores for the BRbased predictions. The main improvement could be achieved by sophisticating the URLfinding algorithm in order to increase the recall of the predictions. We suggest to enlarge the training set of the URLfinding algorithm from Dutch to European websites using the company names and URLs registered in the BR. We consider this improvement outside the scope of the current paper, as the results of our supplyside approach already show a strong improvement compared to existing consumersurvey approaches.
To summarize, we have proposed a datadriven supplyside approach for measuring crossborder internet purchases within the EU. Our main findings motivate the implementation of our approach in other EU member states. As our approach is based on EU VAT legislation, international comparability will be guaranteed. Ultimately, it will lead to more accurate estimates of crossborder internet purchases within the entire EU.
Acknowledgements
We would like to express our sincere gratitude to Arjan van Loon for many fruitful discussions on data analysis and research methods. Moreover, we are grateful to Dick Windmeijer for developing the URLfinding software and running it on the data. We also thank Guido van den Heuvel for developing the web scraping software and running it on the data. Furthermore, we thank Myra Bissumbhar and Marja Groen for manually categorizing companies in the training and test datasets. Finally, we are grateful to Willem de Jong and Jos Erkens for helpful discussions on EU VAT legislation. The research was funded by Statistics Netherlands and the Dutch Ministry of Economic Affairs and Climate Policy.
References
 [Bawa et al.2005] Bawa, M., Condie, T. and Ganesan, P. bawa2005lsh. LSH Forest: selftuning indexes for similarity search. In Proceedings of the 14th International Conference on World Wide Web (WWW), pp. 651–660.
 [Breiman and Spector1992] Breiman, L. and Spector, P. breiman1992kfoldcrossval. Submodel Selection and Evaluation in Regression. The XRandom Case. International Statistical Review, 60, number 3, 291–319.
 [Cardona and DuchBrown2016] Cardona, M. and DuchBrown, N. cardona2016delivery. Delivery Costs and Crossborder eCommerce in the EU Digital Single Market. JRC Working Papers on Digital Economy 201603. Joint Research Centre (Seville site).
 [Cohen et al.2003] Cohen, W., Ravikumar, P. and Fienberg, S. cohen2003comparison. A Comparison of String Metrics for Matching Names and Records. In Proceedings of the Workshop on Data Cleaning and Object Consolidation at the 9th International Conference on Knowledge Discovery and Data Mining (KDD), pp. 73–78.
 [Davis and Goadrich2006] Davis, J. and Goadrich, M. davis2006f1vsaucroc. The Relationship Between PrecisionRecall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pp. 233–240.
 [European Commision2015] European Commision eu2015monitoring. Monitoring the Digital Economy & Society 20162021. http://ec.europa.eu/newsroom/dae/document.cfm?doc_id=13706. (Accessed 26 April 2018).
 [GarciaBernardo and Takes2017] GarciaBernardo, J. and Takes, F.W. garcia2016coverage. The Effects of Data Quality on the Analysis of Corporate Board Interlock Networks. Information Systems.
 [GomezHerrera et al.2014] GomezHerrera, E., Martens, B. and Turlea, G. gomez2014drivers. The drivers and impediments for crossborder ecommerce in the EU. Information Economics and Policy, 28, 83–96.
 [Han et al.2011] Han, J., Kamber, M. and Pei, J. han2011datamining. Data Mining: Concepts and Techniques, 3rd edn. Waltham, MA, USA: Morgan Kaufman.
 [Hastie et al.2009] Hastie, T., Tibshirani, R. and Friedman, J. hastie2009statlearn. The Elements of Statistical Learning, 2nd edn, volume 1. New York: Springer.
 [Jeni et al.2013] Jeni, L.A., Cohn, J.F. and De La Torre, F. jeni2013perfmetric. Facing imbalanced data – Recommendations for the use of performance metrics. In Proceedings of the Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), pp. 245–251.

[Kohavi1995]
Kohavi, R.
kohavi1995stratcrossval.
A study of crossvalidation and bootstrap for accuracy estimation
and model selection.
In
International Joint Conference of Artificial Intelligence (IJCAI)
, pp. 1137–1145.  [Leskovec et al.2014] Leskovec, J., Rajaraman, A. and Ullman, J.D. leskovec2014mmds. Mining of Massive Datasets. Cambridge University Press, Cambridge.
 [Lovins1968] Lovins, J.B. lovins1968development. Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics, 11, number 1, 22–31.
 [Manning et al.2009] Manning, C.D., Raghavan, P. and Schütze, H. manning2009IR. Introduction to Information Retrieval. Cambridge University Press, Cambridge.
 [Marcus and Petropoulos2016] Marcus, J.S. and Petropoulos, G. marcus2016commerce. Ecommerce in Europe: parcel delivery prices in a digital single market. Bruegel Policy Contribution, number 2016/09.
 [Martikainen et al.2015] Martikainen, E., Schmiedel, H. and Takalo, T. martikainen2015convergence. Convergence of European retail payments. Journal of Banking & Finance, 50, 81–91.
 [Minges2016] Minges, M. unctad2016ecommerce. In Search of Crossborder Ecommerce Trade Data. http://unctad.org/en/PublicationsLibrary/tn_unctad_ict4d06_en.pdf. (Accessed 26 April 2018).
 [Porter1980] Porter, M.F. porter1980algorithm. An algorithm for suffix stripping. Program, 14, number 3, 130–137.
 [Ribeiro et al.2010] Ribeiro, S.P., Menghinello, S. and De Backer, K. ribeiro2010oecdorbis. The OECD ORBIS database: Responding to the need for firmlevel microdata in the OECD. OECD Statistics Working Papers, 2010, number 1.
 [Schu and Morschett2017] Schu, M. and Morschett, D. schu2017foreign. Foreign market selection of online retailers – A pathdependent perspective on influence factors. International Business Review, 26, number 4, 710–723.
 [Van Delden et al.2015] Van Delden, A., Scholtus, S. and Burger, J. delden2015errors_details. Quantifying the effect of classification errors on the accuracy of mixedsource statistics. discussion paper 2015–10. https://www.researchgate.net/publication/281450992_Quantifying_the_effect_of_classification_errors_on_the_accuracy_of_mixedsource_statistics. (Accessed 26 April 2018).
 [Van Delden et al.2016] Van Delden, A., Scholtus, S. and Burger, J. delden2016errors_published. Accuracy of MixedSource Statistics as Affected by Classification Errors. Journal of Official Statistics, 32, number 3, 619–642.
 [Winkler1990] Winkler, W.E. winkler1990metric. String Comparator Metrics and Enhanced Decision Rules in the FellegiSunter Model of Record Linkage. Proceedings of the Section on Survey Research Methods, American Statistical Association, 354–359.
 [Witten et al.2017] Witten, I.H., Frank, E., Hall, M.A. and Pal, C.J. witten2016datamining. Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Cambridge, MA, USA: Morgan Kaufmann.
Appendix
Appendix A Local Behaviour around Optimal Parameters
This appendix contains additional results on the local behaviour of the mean and standard deviation of F1 scores obtained from the fold crossvalidation around the optimal parameters for the optimal algorithm. The results for and are presented separately.
Note that in Fig. 2, the results for hardly depend on the chosen class weighing scheme. Moreover, different choices of and different choices of , given , only minimally affect the mean F1 score over the 5 folds. The variance seems similar in each of these parameter settings as well. Thus, the mean F1 score is stable around the optimal parameter setting.
Next, we examine the local behaviour of the mean and standard deviation of F1 scores obtained from the fold crossvalidation around the optimal parameters for the algorithms GB, AB and RF predicting . From Fig. 3 it can be derived that increasing the maximum tree depth from the optimal value negatively impacts the goodnessoffit as measured by F1. Moreover, in panels (a) and (d) the mean F1 score is much lower for compared to panel (c). Thus, the results from AB are not stable around the optimal maximum tree depth and not around the optimal learning rate The results do seem independent of the choice of the class weighing scheme, when comparing panels (b) and (c).
From Fig. 4, we may conclude that the mean F1 scores of GB are not very stable around the optimal parameter setting . In panel (a), it can be seen that increasing the maximum depth from the optimal value while fixing the optimal value of , leads to a drop in mean F1 score. The same holds for changing the optimal value while fixing .
Finally, we present the results of RF on the training dataset in Fig. 5. For the balanced class weighing scheme, presented in panel (a), the results seem stable as increases. Moreover, the variance is smaller than in the uniform class weighing scheme, as displayed in panel (b). However, increasing the maximum depth from the optimal value leads to a drop in mean F1 score.
Comments
There are no comments yet.