I have a table with a column 'search_text' type text
.
In that field I have values:
1. 'MyBook MyBook PDF PDF',
2. 'Example 1 Example 2 Example 3'
3. 'John Snow John Snow'
I would like to distinct clean these fields.
Expected result:
1. 'MyBook PDF',
2. 'Example 1 2 3'
3. 'John Snow'
The approach I came up with goes as follows: read the field for each record, split it by space (' '), put each text in array, do array_unique
in PHP, then put the array back to string with join
in PHP.
The thing is, this is a PHP based solution, I would like to have an MySQL solution for this. I have over 180.000 records I need to clean, I don't know what impact it would have to run this on PHP.
I have found a solution for MS SQL: Remove duplicate values in a cell SQL Server
Help greatly appreciated.
SQL of my test data:
CREATE TABLE IF NOT EXISTS `test` (
`id` int(10) unsigned NOT NULL,
`search_text` text COLLATE utf8_unicode_ci NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
INSERT INTO `test` (`id`, `search_text`) VALUES
(1, 'MyBook MyBook PDF PDF'),
(2, 'Example 1 Example 2 Example 3'),
(3, 'John Snow John Snow'),
(4, 'test test test test formula test test test formula test test test formula test test test formula test test test formula test test test formula '),
(5, '');
ALTER TABLE `test`
ADD PRIMARY KEY (`id`);
ALTER TABLE `test`
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=6;
I went for the PHP solution here:
$s = 'John Snow John Snow';
//remove duplicate values in string
$tmpArray = explode(" ", $s);
$tmpArray = array_unique($tmpArray);
$s = join(" ", $tmpArray);
Which is run before INSERT
, and it does what I wanted.
Try this to sort by count :)
SELECT DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(test.search_text, ' ', numbers.n), ' ', - 1) col_name
FROM (
SELECT 1 n
UNION ALL
SELECT 2
UNION ALL
SELECT 3
UNION ALL
SELECT 4
) numbers
INNER JOIN test ON CHAR_LENGTH(test.search_text) - CHAR_LENGTH(REPLACE(test.search_text, ' ', '')) >= numbers.n - 1
ORDER BY col_name;
You will need to write a MySQL function to do this for you. I would think that a PHP page will be just fine. 180,000 records isn't that many and it should (unless you are using a low spec server) run without putting much strain on anything else.
I wrote 2 for you that you might be able to make use of:
DROP PROCEDURE IF EXISTS explode;
DELIMITER //
CREATE PROCEDURE explode(str_string TEXT)
NOT DETERMINISTIC
BEGIN
DROP TABLE IF EXISTS explosion;
CREATE TABLE explosion (id INT AUTO_INCREMENT PRIMARY KEY NOT NULL, word VARCHAR(100));
SET @sql := CONCAT('INSERT INTO explosion (word) VALUES (', REPLACE(QUOTE(str_string), " ", '\'), (\''), ')');
PREPARE myStmt FROM @sql;
EXECUTE myStmt;
END //
DELIMITER ;
This procedure creates an "explode" function for use in MySQL. It uses a temporary table and explodes the words, separated by spaces into it
Then this function will read that table in, and put them into another temporary table with the duplicates removed:
DROP PROCEDURE IF EXISTS removeDuplicates;
DELIMITER //
CREATE PROCEDURE removeDuplicates(str TEXT)
BEGIN
DECLARE temp_word TEXT;
DECLARE last_word TEXT DEFAULT "";
DECLARE result TEXT;
DECLARE finished INT DEFAULT false;
DECLARE words_cursor CURSOR FOR
SELECT word FROM explosion;
DECLARE CONTINUE handler FOR NOT found
SET finished = true;
CALL explode(str);
DROP TABLE IF EXISTS temp_words;
CREATE TABLE temp_words (id INT AUTO_INCREMENT PRIMARY KEY NOT NULL, t VARCHAR(100));
OPEN words_cursor;
loop_words: LOOP
FETCH words_cursor INTO temp_word;
IF finished THEN
LEAVE loop_words;
END IF;
IF last_word = "" THEN
INSERT INTO temp_words (t) VALUES (temp_word);
SET last_word = temp_word;
ITERATE loop_words;
END IF;
IF last_word = temp_word THEN
SET last_word = temp_word;
ITERATE loop_words;
END IF;
INSERT INTO temp_words (t) VALUES (temp_word);
END LOOP loop_words;
CLOSE words_cursor;
END //
DELIMITER ;
So all you need to do is work out how to get the records in temp_words
into your current database table.