Talk:Box Plot - Parallel Box Plot

- 03.03

pandas.pydata.org



New Question!

If the only data I have is a box plot, can I determine the SD and SE of the mean? Thanks --Preceding unsigned comment added by 128.118.7.83 (talk) 13:27, 28 March 2008 (UTC)

Parallel Box Plot Video




Question!

I have a question. In the data you use to make the diagram, if a number is repeated, how do you put that on the graph? I'm confused.

--The preceding unsigned comment was added by 74.244.254.26 (talk) 23:08, 6 December 2006 (UTC).

R Boxplot width - goto: answer
gotoanswer.com


Expansion, revision

Hey folks, When I came across this page it was in need of some serious revision. There were grammatical errors and repeated information. I took the liberty of removing some of the repeated information and providing a detailed step-by-step construction of the boxplot.

Introducing R: Plotting categorical variables | Exegetic Analytics
www.exegetic.biz


More suggestions

Something on parallel boxplots would be nice. I am very new here, still feeling my way. I can try to do something, but it might be messy, html is not my forte Plf515 23:52, 26 November 2006 (UTC)plf515

Something should be added about "notches". --Preceding unsigned comment added by CptNautilus (talk o contribs) 17:50, 7 November 2007 (UTC)

Figure 2 : Identification of genes that regulate epithelial cell ...
www.nature.com


Error?

"Histograms and probability density functions require assumptions of the statistical distribution." What about empirical histograms and pdf's? As far as I can see these do not require any a-priori assumptions on distributions?

one figure has german legend. And yet it is in english wikipedia. Sakaton (talk) 21:49, 5 February 2013 (UTC)

Exstrom Laboratories - Using Lattice Green Functions to Solve ...
www.exstrom.com


Boxplot Figure

The current figure, Michelsonmorley-boxplot.svg, does not render correctly and could thus confuse readers. I suggest that we revert to the non-svg file R-speed_of_light_boxplot.png. Innohead 13:06, 15 February 2007 (UTC)

Introducing R: Plotting categorical variables | Exegetic Analytics
www.exegetic.biz


How to make a box plot

Ex: Let's say the data is 39, 63, 92, 41, 99, and 55.

1. Find the median. Ex: 39, 99. 41, 92. Then, find the mean of 55 and 63. (It's 59).

2. Find the median of the lower numbers. 39, 59. 41 and 55. Median: 48.

3. Do the same with the higher numbers. 59, 99. 63 and 92. Median: 77.5.

4. Plot on a number line. Ex:

  35--40--45--50--55--60--65--70--75--80--85--90--95--100    39------48-------59------------77.5--------------99   

5. Draw a box around the 3 quartiles (the medians).

6. Label. Sealpiano 01:06, 15 June 2007 (UTC)

yt-project.org


Formatting error

The quartile description is wrongly formatted, with leading [[s but no closing ]]. I'd fix it, but I can't decipher what the correct sentence should be, as I don't understand it and think it may be a partial delete which has become nonsensical.Lilac Soul 08:32, 20 June 2007 (UTC)

R Boxplot width - goto: answer
gotoanswer.com


Errors

Where did someone get the idea that Sheldon invented the box plot? Tukey invented the plot at least as early as 1970, when he was circulating the manuscript for Exploratory Data Analysis. He adapted the shape from a graphic invented by Mary Ellen Spear in 1952 (Tufte shows a picture of her "range chart in VDQI). Tukey modified Spear's charts in several significant ways: he used letter values instead of quartiles, he devised distribution-free quantiles for representing outliers and extreme outliers, and (with McGill) he devised confidence intervals on the median and represented them with notches.

The algorithm for producing the box plot in this article page is not Tukey's. A discussant above describes the correct algorithm. Will someone please read Tukey's EDA and correct these serious errors? As it stands, this article only contributes to the confusion over boxplot definitions and makes it seem as if anything goes. As the Frigge, Hoaglin, and Iglewicz article clearly shows, only a few statistics packages (Minitab, SYSTAT, DataDesk) get it right. 202.62.81.253 (talk) 04:15, 23 January 2008 (UTC)



In need of attention

1) The external link http://www.physics.csbsju.edu/stats/box2.html contradicts the article in regards to how you mark outliers / Suspected outliers on a plot. 2) I believe the "whisker" of a box plot is the line drawn from the box, not the tick mark at the end. 3) The whole article could use some cosmetic work and in-line citation. Ajonlime (talk) 01:35, 29 January 2008 (UTC)

No, Tukey did not give 3 variations on the box plot. You are trusting a secondary source who has not read Tukey carefully and did not know Tukey when he was alive. Go ahead and "follow the links" from a source who never met Tukey, but it will take you to misinformation. Tukey actually cared about the differences you are dismissing with a shrug. Furthermore, the site is obviously being vandalized, since the Sheldon quote has turned up again. This is a minor issue and a small point, but I think it is representative of a Wikipedia problem. Wikipedians have excluded experts by requiring secondary-source attribution, and so the material that ends up in a listing like this is often wrong. You can argue all you want, but I knew Tukey and he would be amused by the content of this article. --Preceding unsigned comment added by 67.173.98.211 (talk) 00:18, 26 March 2008 (UTC)

Thanks. I'm a bit gun-shy about editing and, as you can see, not too literate on the Wiki conventions. I thought it would be better to stay in the background and goad others to think more about the box plot entry. It's fascinating to me that anyone would care to vandalize a page with a "Sheldon" comment without providing a single piece of evidence. I can't imagine the social dynamic going on here. And yes, practices do evolve. One fairly widely used statistics package put the center on the mean, the hinges at one standard deviation, and the end of the whiskers at the extreme values. They called this EDA. It completely defeated the purpose of the box plot and, worse, it led many to think this is what a box plot is. So history does matter. We don't go changing the formula for least-squares regression and call it "least-squares," even though there are more useful regression methods available today. There are better contemporary alternatives to the box plot (for the purpose Tukey had in mind), but we shouldn't change the meaning of the display to defeat his original purpose in inventing it. Thanks again for your thoughtful reply. --Preceding unsigned comment added by 67.173.98.211 (talk) 14:05, 26 March 2008 (UTC)

The violin plot is nice, but it's just a kernel density plot. Better to overlay it with a box plot. Then you get both types of information. Also, the violin plot suffers from the bandwidth estimation problem that someone mentioned in the box-plot article. A better alternative is the dot-box plot, found in Wilkinson, L (1999). Dot plots. The American Statistician, 53, 276-281. It overlays a box plot with a dot plot, so you can see all the data and also see the median, outliers, etc. It ameliorates the main deficiency of the box plot - that it can look identical for certain unimodal and multimodal datasets. I also like the Hofmann, Wickham, Kafadar letter-value box plots <http://www.stat.iastate.edu/preprint/articles/2006-10.pdf>. This paper covers a lot of the details I've been bringing up in this discussion.

I want to illustrate the specifics of our disagreement. The argument has nothing to do with the appearance of the box plot (that's the topic of the Mathworld article; Tukey drew several different kinds of box plots -- that's just a matter of surface appearance). It has to do with letter values vs. quantiles. Basically, your statement that "sampling variation is always much bigger than the variation between definitions so the latter is of very little consequence" is false. Take the simple example x = {1,5,6,7,9}. The first 3 Tukey letter values for this batch are 5 (median) and <6, 7> (hinges). The Tukey box plot for these data show an outlier for 1, a box from 5 to 7 and one whisker from 7 to 9. Now, there are many ways to compute quartiles. Here are just a few results using different algorithms (try it with SAS or another comparable statistics package): <4, 6, 7.5>, <2, 5.5, 6.75>, <5, 6, 7>, <3, 6, 8>. Only one of these (the <5, 6, 7> based on the empirical cumulative distribution function) yields the Tukey letter values; and even that method doesn't always yield letter values. If you draw a box plot based on those estimates, it will look quite different from Tukey's. Only one of the quartile methods yields an outlier. Now, you might say that larger datasets will show less dramatic a difference. That would be generally true, but it is easy to construct counterexamples. The kind of ill-behaved data Tukey anticipated are precisely the ones that are smoked out in his box plot (as opposed to histograms and other density estimators).

Let me describe Tukey's letter-value algorithm, because the poster above didn't quite get it right: 1. Sort the data. 2. Label the sorted list W. 3. Compute the conventional median of W (pick middle value if N is odd, or average two middle values if N is even). Save this letter value. 4. Split W at the median into two lists, L and U. If N is odd, include the median at the end of L and at the beginning of U. 5. Recurse 3-4 for L and U (labeling each as W) until there are no blocks left to split.

The algorithm is most easily programmed as a recursive function, but it is simple to do in a loop with several pointers. For the box plot, we need to recurse only once to get the hinges. Tukey computed more letter values to characterize distributions in more detail. The Hofmann, Wickham, Kafadar letter-value plot exploits this characterization.

Now, why did Tukey do such a peculiar thing? 1. It was simple to do with a paper and pencil. That's one of the main points of EDA. Although Tukey revolutionized statistical computing, he always chose the simpler course over the more complex when he could. 2. Tukey chose actual data values as descriptors (exemplars) instead of latent, hidden, hypothetical, population (pick your word) parameters. There are exceptions, of course, but note how often he used letter values in other, more complex, analytics such as smoothing. He didn't like getting too far away from data, and this was the source of many of the controversies Tukey got into with model-oriented statisticians. 3. Letter values have a precise definition in terms of the data batch. A high-school student can understand the algorithm. If you do some research on quantiles, by contrast, you will find it a morass of different approaches. The elementary statistics book algorithm, based on linear interpolation, barely scratches the surface. 4. Letter values are robust. See Understanding Robust and Exploratory Data Analysis by David C. Hoaglin, Frederick Mosteller, John W. Tukey, John Wiley & Sons., 2000. Many quantile methods depend on restrictive assumptions on the data.

So, this is of more than historical interest. The statistics packages really do differ, sometimes substantially. That's the point of the Hoaglin et al. article cited in the references. So, if I were writing this article, I'd devote a paragraph to letter values vs. quantiles. And I'd point out the difference -- feel free to use my example. It's not that one method's right and the other wrong. It's that the quantile/letter-value distinction can have a profound effect on the appearance of the box plot -- enough to influence what one considers an outlier. You would be surprised at how much box plots differ across statistics packages. On the same data. That's because the quantile-based box plots don't always disclose the algorithm they are using to estimate quantiles. With Tukey letter-values, there's no ambiguity.

I'm going to frustrate you again by not touching the article. You are obviously an intelligent and curious editor experienced with the ways of Wiki and a good monitor of this area (I know that sounds patronizing, but I don't mean it to be). And I suspect you edit in other statistical areas as well. So, in the end, the more research you pick up on your own, the more likely the quality of these articles will improve. I am an expert in this area but I'm not likely to get involved much further. And I really should stop adding comments to this discussion, because it's taking more space than the topic deserves. Thanks for your understanding.67.173.98.211 (talk) 16:25, 28 March 2008 (UTC)




New Question!

If the only data I have is a box plot, can I determine the SD and SE of the mean? Thanks --Preceding unsigned comment added by 128.118.7.83 (talk) 13:27, 28 March 2008 (UTC)

Actually, you can, but it's kind of pointless. The estimate isn't very good. Estimating the standard deviation from the range was used in the classic quality control literature because it was more difficult to cumulate sums of squares on a simple calculator. See Introduction to Statistical Quality Control, 5th ed., Douglas Montgomery, pages 95-6. If you use a box plot to do this, you will have to be sure there are no extreme outliers, because they will bias the estimate even more than usual.67.173.98.211 (talk) 14:16, 28 March 2008 (UTC)



About terminology: boxplot or boxplots

If there is in a Figure where there is several box plots, like the first Figure in this article, should it be called boxplot or boxplots? Now in the article it is called boxplot (singular). Yebbey (talk) 04:21, 27 April 2011 (UTC)



External Link to Box Plot Tutorial

Re: Link to my web site that I added to the Box Plot page and removed by Glrx. I agree that the link I posted is inconsistent with the WP:NOTHOWTO policy, which I am now familiar with. My intent was not to promote my own products; the link is a tutorial, potentially very useful, though there is promotional material on the page. JonPeltier (talk) 16:16, 17 August 2011 (UTC)



Revise box plot image?

Shouldn't it say "median" instead of "medium" (fig. 5)? -- Preceding unsigned comment added by 12.207.23.130 (talk) 02:31, 21 October 2011 (UTC)



Figure 3

Shouldn't the whiskers in Figure 3 be of equal length? -- Preceding unsigned comment added by Raywood (talk o contribs) 21:23, 2 September 2012 (UTC)



Are You Looking for Products

Here some products related to "Talk:Box Plot".

Amazon.com: Dragon Ball Xenoverse - Xbox 360: Video Games
Dragon Ball Xenoverse - X..
Masons and Mystery at the 33rd Parallel - Kindle edition by Day ...
Masons and Mystery at the..
The Holy Woman: Qaisra Shahraz: 9781908129352: Amazon.com: Books
The Holy Woman: Qaisra Sh..
The Age of Wire and String: Ben Marcus: 9781564781963: Amazon.com ...
The Age of Wire and Strin..

Get these at Amazon.com

* amzn.to is official short URL for Amazon.com, provided by Bitly

Source of the article : here





EmoticonEmoticon

 

Start typing and press Enter to search